This documentation corresponds to an older version of the product, is no longer updated, and may contain outdated information.
Please access the latest versions from https://cisco-tailf.gitbook.io/nso-docs and update your bookmarks. OK

NSO Alarms

Overview

NSO generates alarms for serious problems that must be remedied. Alarms are available over all north-bound interfaces and the exist at the path /alarms. NSO alarms are managed as any other alarms by the general NSO Alarm Manager, see the specific section on the alarm manager in order to understand the general alarm mechanisms.

The NSO alarm manager also presents a northbound SNMP view, alarms can be retrieved as an alarm table, and alarm state changes are reported as SNMP Notifications. See the "NSO Northbound" documentation on how to configure the SNMP Agent.

This is also documented in the example /examples.ncs/getting-started/using-ncs/5-snmp-alarm-northbound.

Alarm type structure

 alarm-type
     ha-alarm
         certificate-expiration
         ha-node-down-alarm
             ha-primary-down
             ha-secondary-down
     ncs-cluster-alarm
         cluster-subscriber-failure
     ncs-dev-manager-alarm
         abort-error
         bad-user-input
         commit-through-queue-blocked
         commit-through-queue-failed
         commit-through-queue-failed-transiently
         commit-through-queue-rollback-failed
         configuration-error
         connection-failure
         final-commit-error
         missing-transaction-id
         ned-live-tree-connection-failure
         out-of-sync
         revision-error
     ncs-package-alarm
         package-load-failure
         package-operation-failure
     ncs-service-manager-alarm
         service-activation-failure
     ncs-snmp-notification-receiver-alarm
         receiver-configuration-error
     time-violation-alarm
         transaction-lock-time-violation

Alarm type descriptions

Table 2. Alarm type descriptions (alphabetically)
Alarm Identity Initial Perceived Severity
abort-error major
Description Recommended Action
An error happened while aborting or reverting a transaction. Device's configuration is likely to be inconsistent with the NCS CDB. Inspect the configuration difference with compare-config, resolve conflicts with sync-from or sync-to if any.
Alarm message(s)
  • Device {dev} is locked

  • Device {dev} is southbound locked

  • abort error

Clear condition(s)
If NCS achieves sync with the device, or receives a transaction id for a netconf session towards the device, the alarm is cleared.
Alarm Identity
alarm-type
Description
Base identity for alarm types. A unique identification of the fault, not including the managed object. Alarm types are used to identify if alarms indicate the same problem or not, for lookup into external alarm documentation, etc. Different managed object types and instances can share alarm types. If the same managed object reports the same alarm type, it is to be considered to be the same alarm. The alarm type is a simplification of the different X.733 and 3GPP alarm IRP alarm correlation mechanisms and it allows for hierarchical extensions. A 'specific-problem' can be used in addition to the alarm type in order to have different alarm types based on information not known at design-time, such as values in textual SNMP Notification varbinds.
Alarm Identity Initial Perceived Severity
bad-user-input critical
Description Recommended Action
Invalid input from user. NCS cannot recognize parameters needed to connect to device. Verify that the user supplied input are correct.
Alarm message(s)
  • Resource {resource} doesn't exist

Clear condition(s)
This alarm is not cleared.
Alarm Identity
certificate-expiration
Description Recommended Action
The certificate is nearing its expiry or has already expired. The severity depends on the time left to expiry, it ranges from warning to critical. Replace certificate.
Alarm message(s)
  • Certificate expires in less than {days} day(s)/Certificate has expired.

Clear condition(s)
This alarm is cleared when the certificate is no longer loaded.
Alarm Identity Initial Perceived Severity
cluster-subscriber-failure critical
Description Recommended Action
Failure to establish a notification subscription towards a remote node. Verify IP connectivity between cluster nodes.
Alarm message(s)
  • Failed to establish netconf notification subscription to node ~s, stream ~s

  • Commit queue items with remote nodes will not receive required event notifications.

Clear condition(s)
This alarm is cleared if NCS succeeds to establish a subscription towards the remote node, or when the subscription is explicitly stopped.
Alarm Identity Initial Perceived Severity
commit-through-queue-blocked warning
Description
A commit was queued behind a queue item waiting to be able to connect to one of its devices. This is potentially dangerous since one unreachable device can potentially fill up the commit queue indefinitely.
Alarm message(s)
  • Commit queue item ~p is blocked because item ~p cannot connect to ~s

Clear condition(s)
An alarm raised due to a transient error will be cleared when NCS is able to reconnect to the device.
Alarm Identity Initial Perceived Severity
commit-through-queue-failed critical
Description Recommended Action
A queued commit failed. Resolve with rollback if possible.
Alarm message(s)
  • Failed to authenticate towards device {device}: {reason}

  • Device {dev} is locked

  • {Reason}

  • Device {dev} is southbound locked

  • Commit queue item {CqId} rollback invoked

  • Commit queue item {CqId} has failed: Operation failed because: inconsistent database

  • Remote commit queue item ~p cannot be unlocked: cluster node not configured correctly

Clear condition(s)
This alarm is not cleared.
Alarm Identity Initial Perceived Severity
commit-through-queue-failed-transiently critical
Description Recommended Action
A queued commit failed as it exhausted its retry attempts on transient errors. Resolve with rollback if possible.
Alarm message(s)
  • Failed to connect to device {dev}: {reason}

  • Connection to {dev} timed out

  • Failed to authenticate towards device {device}: {reason}

  • The configuration database is locked for device {dev}: {reason}

  • the configuration database is locked by session {id} {identification}

  • the configuration database is locked by session {id} {identification}

  • {Dev}: Device is locked in a {Op} operation by session {session-id}

  • resource denied

  • Commit queue item {CqId} rollback invoked

  • Commit queue item {CqId} has failed: Operation failed because: inconsistent database

  • Remote commit queue item ~p cannot be unlocked: cluster node not configured correctly

Clear condition(s)
This alarm is not cleared.
Alarm Identity Initial Perceived Severity
commit-through-queue-rollback-failed critical
Description Recommended Action
Rollback of a commit-queue item failed. Investigate the status of the device and resolve the situation by issuing the appropriate action, i.e., service redeploy or a sync operation.
Alarm message(s)
  • {Reason}

Clear condition(s)
This alarm is not cleared.
Alarm Identity Initial Perceived Severity
configuration-error critical
Description Recommended Action
Invalid configuration of NCS managed device, NCS cannot recognize parameters needed to connect to device. Verify that the configuration parameters defined in tailf-ncs-devices.yang submodule are consistent for this device.
Alarm message(s)
  • Failed to resolve IP address for {dev}

  • the configuration database is locked by session {id} {identification}

  • {Reason}

  • Resource {resource} doesn't exist

Clear condition(s)
The alarm is cleared when NCS reads the configuration parameters for the device, and is raised again if the parameters are invalid.
Alarm Identity Initial Perceived Severity
connection-failure major
Description Recommended Action
NCS failed to connect to a managed device before the timeout expired. Verify address, port, authentication, check that the device is up and running. If the error occurs intermittently, increase connect-timeout.
Alarm message(s)
  • The connection to {dev} was closed

  • Failed to connect to device {dev}: {reason}

Clear condition(s)
If NCS successfully reconnects to the device, the alarm is cleared.
Alarm Identity Initial Perceived Severity
final-commit-error critical
Description Recommended Action
A managed device validated a configuration change, but failed to commit. When this happens, NCS and the device are out of sync. Reconcile by comparing and sync-from or sync-to.
Alarm message(s)
  • The connection to {dev} was closed

  • External error in the NED implementation for device {dev}: {reason}

  • Internal error in the NED NCS framework affecting device {dev}: {reason}

Clear condition(s)
If NCS achieves sync with a device, the alarm is cleared.
Alarm Identity
ha-alarm
Description
Base type for all alarms related to high availablity. This is never reported, sub-identities for the specific high availability alarms are used in the alarms.
Alarm Identity
ha-node-down-alarm
Description
Base type for all alarms related to nodes going down in high availablity. This is never reported, sub-identities for the specific node down alarms are used in the alarms.
Alarm Identity Initial Perceived Severity
ha-primary-down critical
Description Recommended Action
The node lost the connection to the primary node. Make sure the HA cluster is operational, investigate why the primary went down and bring it up again.
Alarm message(s)
  • Lost connection to primary due to: Primary closed connection

  • Lost connection to primary due to: Tick timeout

  • Lost connection to primary due to: code {Code}

Clear condition(s)
This alarm is never automatically cleared and has to be cleared manually when the HA cluster has been restored.
Alarm Identity Initial Perceived Severity
ha-secondary-down critical
Description Recommended Action
The node lost the connection to a secondary node. Investigate why the secondary node went down, fix the connectivity issue and reconnect the secondary to the HA cluster.
Alarm message(s)
  • Lost connection to secondary

Clear condition(s)
This alarm is cleared when the secondary node is reconnected to the HA cluster.
Alarm Identity Initial Perceived Severity
missing-transaction-id warning
Description Recommended Action
A device announced in its NETCONF hello message that it supports the transaction-id as defined in http://tail-f.com/yang/netconf-monitoring. However when NCS tries to read the transaction-id no data is returned. The NCS check-sync feature will not work. This is usually a case of misconfigured NACM rules on the managed device. Verify NACM rules on the concerned device.
Alarm message(s)
  • {Reason}

Clear condition(s)
If NCS successfully reads a transaction id for which it had previously failed to do so, the alarm is cleared.
Alarm Identity
ncs-cluster-alarm
Description
Base type for all alarms related to cluster. This is never reported, sub-identities for the specific cluster alarms are used in the alarms.
Alarm Identity
ncs-dev-manager-alarm
Description
Base type for all alarms related to the device manager This is never reported, sub-identities for the specific device alarms are used in the alarms.
Alarm Identity
ncs-package-alarm
Description
Base type for all alarms related to packages. This is never reported, sub-identities for the specific package alarms are used in the alarms.
Alarm Identity
ncs-service-manager-alarm
Description
Base type for all alarms related to the service manager This is never reported, sub-identities for the specific service alarms are used in the alarms.
Alarm Identity
ncs-snmp-notification-receiver-alarm
Description
Base type for SNMP notification receiver Alarms. This is never reported, sub-identities for specific SNMP notification receiver alarms are used in the alarms.
Alarm Identity Initial Perceived Severity
ned-live-tree-connection-failure major
Description Recommended Action
NCS failed to connect to a managed device using one of the optional live-status-protocol NEDs. Verify the configuration of the optional NEDs. If the error occurs intermittently, increase connect-timeout.
Alarm message(s)
  • The connection to {dev} was closed

  • Failed to connect to device {dev}: {reason}

Clear condition(s)
If NCS successfully reconnects to the managed device, the alarm is cleared.
Alarm Identity Initial Perceived Severity
out-of-sync major
Description Recommended Action
A managed device is out of sync with NCS. Usually it means that the device has been configured out of band from NCS point of view. Inspect the difference with compare-config, reconcile by invoking sync-from or sync-to.
Alarm message(s)
  • Device {dev} is out of sync

  • Out of sync due to no-networking or failed commit-queue commits.

  • got: ~s expected: ~s.

Clear condition(s)
If NCS achieves sync with a device, the alarm is cleared.
Alarm Identity Initial Perceived Severity
package-load-failure critical
Description Recommended Action
NCS failed to load a package. Check the package for the reason.
Alarm message(s)
  • failed to open file {file}: {str}

  • Specific to the concerned package.

Clear condition(s)
If NCS successfully loads a package for which an alarm was previously raised, it will be cleared.
Alarm Identity Initial Perceived Severity
package-operation-failure critical
Description Recommended Action
A package has some problem with its operation. Check the package for the reason.
Clear condition(s)
This alarm is not cleared.
Alarm Identity Initial Perceived Severity
receiver-configuration-error major
Description Recommended Action
The snmp-notification-receiver could not setup its configuration, either at startup or when reconfigured. SNMP notifications will now be missed. Check the error-message and change the configuration.
Alarm message(s)
  • Configuration has errors.

Clear condition(s)
This alarm will be cleared when the NCS is configured to successfully receive SNMP notifications
Alarm Identity Initial Perceived Severity
revision-error major
Description Recommended Action
A managed device arrived with a known module, but too new revision. Upgrade the Device NED using the new YANG revision in order to use the new features in the device.
Alarm message(s)
  • The device has YANG module revisions not supported by NCS. Use the /devices/device/check-yang-modules action to check which modules that are not compatible.

Clear condition(s)
If all device yang modules are supported by NCS, the alarm is cleared.
Alarm Identity Initial Perceived Severity
service-activation-failure critical
Description Recommended Action
A service failed during re-deploy. Corrective action and another re-deploy is needed.
Alarm message(s)
  • Multiple device errors: {str}

Clear condition(s)
If the service is successfully redeployed, the alarm is cleared.
Alarm Identity
time-violation-alarm
Description
Base type for all alarms related to time violations. This is never reported, sub-identities for the specific time violation alarms are used in the alarms.
Alarm Identity Initial Perceived Severity
transaction-lock-time-violation warning
Description Recommended Action
The transaction lock time exceeded its threshold and might be stuck in the critical section. This threshold is configured in /ncs-config/transaction-lock-time-violation-alarm/timeout. Investigate if the transaction is stuck and possibly interrupt it by closing the user session which it is attached to.
Alarm message(s)
  • Transaction lock time exceeded threshold.

Clear condition(s)
This alarm is cleared when the transaction has finished.