Monitoring and Alerting

Internal Monitoring

Service monitoring

The platform uses both systemd and monit daemons to monitor all essential services. Since Sipwise C5 runs in an active/standby mode, not all services are always running on both nodes, some of them will only run on the active node and be stopped on the standby node. The following commands show the most critical services on the platform:

  • ngcp-service summary - to get the list of services and their current status,

  • systemctl status - to get a tree of the services running,

  • systemctl list-units - to get a list of the service states,

  • monit summary - to get the list of services known to monit and their current status,

  • monit status - to get the list of services known to monit with detailed status.

When you perform a stop/start/monitor/unmonitor operation on a service, monit affects other services that depend on the initial one. Hence, if you stop or unmonitor a service all services that depend on it will be stopped or unmonitored as well.

For example, monit stop mysql operation will stop kamailio, sbc, asterisk, prosody and some other services. Although the recommended way to operate on services is via the ngcp-service wrapper which will take care of abstracting the underlying process monitoring implementation.

If any service ever fails for whatever reason either the systemd or monit daemons will quickly restart it. When that happens, the daemon will send a notification email to the address specified in the config.yml file under the general.adminmail key. It will also send warning emails to this address under certain abnormal conditions, such as high memory consumption (> 75% is used) or high CPU load.

In order for monit to be able to send emails to the specified address, the local MTA (exim4) must be configured correctly. The CE edition’s handbook contains more information about this in the Installation chapter.

System monitoring backend

The platform uses the Prometheus monitoring backend on new installations and on upgraded systems that have been migrated.

The platform uses various monitoring backend services to monitor many aspects of the system, including CPU, memory, swap, disk, filesystem, network, processes, NTP, Nginx, Redis and MySQL.

The gathered information is stored in VictoriaMetrics which is a long-term storage backend for Prometheus. NOTE: Both VictoriaMetrics and Prometheus can act as the prometheus server implementation, and are mutually exclusive in their execution.

Sipwise C5 specific monitoring via ngcp-witnessd

The platform uses the internal ngcp-witnessd service to monitor Sipwise C5 specific metrics or system metrics currently not tracked by the monitoring backend (via Prometheus exporters), including HA status, MTA, Kamailio, SIP and MySQL.

The gathered information is stored in VictoriaMetrics in the ngcp namespace on its time-series database.

Some of the data gathering can be disabled (most are enabled by default) through the config.yml file, and those data points will then either be missing from the database or be initialized with a stub value. This will then cascade into other subsystems using this monitoring information, such as Grafana dashboards or SNMP OIDs. The enable/disable flags can be found in the witnessd.gather section.

Monitoring data in the monitoring backend

The platform uses VictoriaMetrics as a long-term Prometheus time series database to store most of the metrics collected in the system.

On a Sipwise C5 each node stores its own metrics and the ones for their peer node, and in addition on CARRIER systems the management nodes store the metrics for all the nodes in the cluster. On new installations and migrated ones this is done with Prometheus instances on each peer, and a VictoriaMetrics instance on the management node which uses its Prometheus federation and scrapping support.

The monitoring data is used by various components of the platform, including ngcp-collective-check, ngcp-snmp-agent and by the statistics dashboard powered by Grafana.

The monitoring data can also be accessed directly by various means. On new installations by using the promtool command-line tool; or by using the HTTP API with curl (or other HTTP fetchers), or with the NGCP::Prometheus::HTTP perl module.

Monitoring metrics

See appendices-main:appendices-main.adoc#prometheus-monitoring-metrics for detailed information about the list of ngcp namespaced metrics stored in the Prometheus monitoring database.

PromQL

See https://prometheus.io/docs/prometheus/latest/querying/basics/ for information about PromQL, the query language used by Prometheus.

To get the list of all metrics for a specific namespace the following query can be used {__name__=~"^namespace_.+"}.

Statistics Dashboard

The platform’s administration interface (described in basicconfiguration:basicconfiguration.adoc#administrative-configuration) provides a graphical overview based on Grafana of the most important system health indicators, such as memory usage, load averages and disk usage. VoIP statistics, such as the number of concurrent active calls, the number of provisioned and registered subscribers, etc. is also present.

External Monitoring Using SNMP

Overview and Initial Setup

The Sipwise C5 exports a variety of cluster health data and statistics over the standard SNMP interface. By default, the SNMP interface can only be accessed locally. To make it possible to provide the SNMP data to an external system, the config.yml file needs to be edited and the list of allowed community names and allowed hosts/IP ranges must be populated. This list can be found under the snmpd.communities key and it consists of one or more hashes of name and sources key/values. The community name is the allowed community name, while sources is a list of IP address or IP blocks where to allow the requests from.

The SNMP notifications (or traps) can also be configured in a similar way, to send them to an external system, by populating the snmpd.trap_communities key with name and targets key/values. The community trap name is the value that will be used when sending the trap, while the targets is a list of IP addresses where to send the trap.

The public communities with the localhost source and target are used for local testing of SNMP functionality. It is recommended that you leave these entries in place. Other legal sources can be formed as single IP addresses or IP blocks in IP/prefix notation, for example 192.168.115.0/24. Other targets can be formed as single IP addresses.

The origin of the SNMP notifications for the SIPWISE MIBs can also be configured with the snmpagent.traps.origin. The supported modes are:

  • legacy: The node triggering the condition and its peer (if available) will emit the trap, in addition the management node pair (if distinct) will also emit the trap. This was the original behavior.

  • mgmt: Only the active management node will emit the trap. This is the current default.

  • distributed: Only the node triggering the condition will emit the trap. For cluster-wide conditions (those that are not node-specific), this mode is equivalent to the mgmt mode.

The Sipwise C5 supports two types of SNMP traps. Event-based, sent whenever a state changes, with a single trap per tracked state. Alarm-based, sent on problematic conditions arising or clearing, with a different trap per state group, where the trap severity is included as part of the trap itself (in addition to the usual convention of documenting it in the MIB). These two types can be enabled or disabled independently (both are enabled by default), depending on the type of monitoring intended.

The event-based traps have been supported for longer, while the alarm-based traps were introduced in mr9.5, so if there are interoperability requirements with various Sipwise C5 versions, event-based traps have better availability, although with different properties.
To locally check if SNMP is working correctly, execute the command snmpwalk -v2c -cpublic localhost . (note the trailing dot). This will generate a long list of raw SNMP OIDs and their values, provided that the default SNMP community key has been left in place. Alternatively the ngcp-systems-tests program checks whether several expected SNMP OIDs are present.
To locally check if SNMP notifications (or traps) are working correctly, enable the snmptrapd daemon, which will be configured by default to catch the traps sent by the localhost SNMP agent. The traps will show up on /var/log/ngcp/snmp-trap.log, and a couple of traps can be generated by running ngcp-service restart snmpd. Even though traps generated by ngcp-snmp-agent are logged on /var/log/ngcp/snmp-agent.log, because the service responsible for sending out the traps is snmpd, checking with snmptrapd is always an additional safety check in case of problems.
To get information from SNMP tables, you can use the command snmptable -v2c -cpublic -Ci localhost TABLE-OID, where TABLE-OID could be for example procTable.
When using snmptable you might want to use the -S option in less (either when calling it or typing it on its prompt) to get proper tabular output that does not fold on terminal end.
SNMP version 1 and version 2c are supported.

Details

There are two kinds of information that can be retrieved from SNMP OIDs (Object Identifiers). The first one is the native Sipwise C5 cluster overview from Sipwise C5 MIBs (Management Information Bases), which is available from the management nodes. The second is from the stock snmpd implementing the UCD (University of California, Davis) MIBs, which requires querying each individual node.

Sipwise C5 OIDs

The entire Sipwise C5 cluster can be monitored from the management nodes by using the SIPWISE-NGCP-MIB, SIPWISE-NGCP-MONITOR-MIB and SIPWISE-NGCP-ALARMS-MIB. These OIDs are rooted at Sipwise C5 slot .1.3.6.1.4.1.34274.1.*.

The MIBs are self-documented, and can be found as part of the ngcp-snmp-mibs package (running dpkg -S 'SIPWISE*MIB' will list their pathnames). The Sipwise C5 SNMP Agent is a part of the ngcp-snmp-agent package, which is installed by default and works out-of-the-box as long as the snmpd has been properly configured.

The SIPWISE-NGCP-MIB acts as the root MIB and provides information about the cluster licensing and layout (which is mostly static data about each node, such as node name, its IP address, its roles, etc.) and information required to access the OIDs from the other MIBs. The clusterTable defines the nodes layout of the cluster, and its cluster node index (cnIndex) is used by many of the other tables to index entries within that specific cluster node (for example within the procTable in the monitor MIB).

The SIPWISE-NGCP-MONITOR-MIB provides current monitoring information, global health conditions, the number of provisioned and registered subscribers and devices. It also provides per node information (independently of the number of nodes or their names) on their filesystem, processes, databases, system load, memory, HA status, MTA queues, etc. In addition it defines the event-based traps.

The SIPWISE-NGCP-ALARMS-MIB defines the alarm-based traps.

OIDs under the following trees are not yet implemented: ngcpMonitorFraud, ngcpMonitorPerformance.sipStatsTable.sipCallAttemptsPerSecond. Deprecated OIDs are currently implemented but will eventually be obsoleted. Obsolete OIDs are not implemented and won’t be in the future.
The Sipwise C5 SNMP Agent uses Redis and Prometheus as data sources. This data is essential for accurate and complete monitoring data in the SNMP OID tree. In addition, the Redis database must be available on a shared IP address, so that ngcp-witnessd can always write to it.

UCD OIDs

All basic system health variables (such as memory, disk, swap, CPU usage, network statistics, process lists, etc.) for every node can also be found in standard OID slots from standard MIBs from each node. For example, memory statistics can be found through the UCD-SNMP-MIB in OIDs such as memTotalSwap.0, memAvailSwap.0, memTotalReal.0, memAvailReal.0, etc., which translate to numeric OIDs .1.3.6.1.4.1.2021.4.*. In fact, UCD-SNMP-MIB is a useful MIB for overall non-centralized system health checks.

Additionally, there is a list of specially monitored processes, also found through the UCD-SNMP-MIB. UCD-SNMP-MIB::prNames (.1.3.6.1.4.1.2021.2.1.2) gives the list of monitored processes, prCount (.1.3.6.1.4.1.2021.2.1.5) is how many of each process are running and prErrorFlag (.1.3.6.1.4.1.2021.2.1.100) gives a 0/1 error indication (with prErrMessage (.1.3.6.1.4.1.2021.2.1.101) providing an explanation of any error).

Some of these processes are not supposed to be running on the standby node, so you will see the error flag raised there. A possible solution is to run these SNMP checks against the shared service IP of the cluster. See in architecture:architecture.adoc#high-availability below for more information.
Furthermore, Sipwise C5 used to provide platform specific information via the UCD-SNMP-MIB custom external extension OIDs, which have been superseded by the Sipwise MIBs, and need to be migrated to use the latter. The names of these OIDs could be found under the UCD-SNMP-MIB::extNames (.1.3.6.1.4.1.2021.8.1.2) tree, with extOutput (.1.3.6.1.4.1.2021.8.1.101) providing the output (one line) from each check and extResult (.1.3.6.1.4.1.2021.8.1.100) the exit code from each check. The following table gives a rough mapping for that migration:
UCD OID name UCD check name SIPWISE-NGCP OID name

UCD-SNMP-MIB::extNames.1

collective_check

SIPWISE-NGCP-MONITOR-MIB::ngcpCollectiveCheckResult and SIPWISE-NGCP-MONITOR-MIB::ngcpCollectiveCheckOutput

UCD-SNMP-MIB::extNames.2

sip_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipResponsiveness.*

UCD-SNMP-MIB::extNames.3

sip_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipResponsiveness.*

UCD-SNMP-MIB::extNames.4

mysql_check_sp1

SIPWISE-NGCP-MONITOR-MIB::dbQueryRate.*

UCD-SNMP-MIB::extNames.5

mysql_check_sp2

SIPWISE-NGCP-MONITOR-MIB::dbQueryRate.*

UCD-SNMP-MIB::extNames.6

mysql_replication_check_sp1

SIPWISE-NGCP-MONITOR-MIB::dbReplDelay.*

UCD-SNMP-MIB::extNames.7

mysql_replication_check_sp2

SIPWISE-NGCP-MONITOR-MIB::dbReplDelay.*

UCD-SNMP-MIB::extNames.8

mpt_check_sp1

Obsolete

UCD-SNMP-MIB::extNames.9

mpt_check_sp2

Obsolete

UCD-SNMP-MIB::extNames.10

exim_queue_check_sp1

SIPWISE-NGCP-MONITOR-MIB::mailQueue.*

UCD-SNMP-MIB::extNames.11

exim_queue_check_sp2

SIPWISE-NGCP-MONITOR-MIB::mailQueue.*

UCD-SNMP-MIB::extNames.12

provisioned_subscribers_check_sp1

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterProvSubs

UCD-SNMP-MIB::extNames.13

provisioned_subscribers_check_sp2

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterProvSubs

UCD-SNMP-MIB::extNames.14

kam_dialog_active_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*

UCD-SNMP-MIB::extNames.15

kam_dialog_active_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*

UCD-SNMP-MIB::extNames.16

kam_dialog_early_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*

UCD-SNMP-MIB::extNames.17

kam_dialog_early_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*

UCD-SNMP-MIB::extNames.18

kam_dialog_type_local_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*

UCD-SNMP-MIB::extNames.19

kam_dialog_type_local_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*

UCD-SNMP-MIB::extNames.20

kam_dialog_type_relay_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipDdialogRelay.*

UCD-SNMP-MIB::extNames.21

kam_dialog_type_relay_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipDdialogRelay.*

UCD-SNMP-MIB::extNames.22

kam_dialog_type_incoming_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipDdialogIncoming.*

UCD-SNMP-MIB::extNames.23

kam_dialog_type_incoming_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipDdialogIncoming.*

UCD-SNMP-MIB::extNames.24

kam_dialog_type_outgoing_check_sp1

SIPWISE-NGCP-MONITOR-MIB::sipDdialogOutgoing.*

UCD-SNMP-MIB::extNames.25

kam_dialog_type_outgoing_check_sp2

SIPWISE-NGCP-MONITOR-MIB::sipDdialogOutgoing.*

UCD-SNMP-MIB::extNames.26

kam_usrloc_regusers_check_sp1

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegSubs

UCD-SNMP-MIB::extNames.27

kam_usrloc_regusers_check_sp2

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegSubs

UCD-SNMP-MIB::extNames.28

kam_usrloc_regdevices_check_sp1

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegDevs

UCD-SNMP-MIB::extNames.29

kam_usrloc_regdevices_check_sp2

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegDevs

UCD-SNMP-MIB::extNames.30

mysql_replication_discrepancies_check_sp1

Obsolete

UCD-SNMP-MIB::extNames.31

mysql_replication_discrepancies_check_sp2

Obsolete

UCD-SNMP-MIB::extNames.32

sip_check_self

SIPWISE-NGCP-MONITOR-MIB::sipResponsiveness.*

UCD-SNMP-MIB::extNames.33

mysql_check_self

SIPWISE-NGCP-MONITOR-MIB::dbQueryRate.*

UCD-SNMP-MIB::extNames.34

mysql_replication_check_self

SIPWISE-NGCP-MONITOR-MIB::dbReplDelay.*

UCD-SNMP-MIB::extNames.35

mpt_check_self

Obsolete

UCD-SNMP-MIB::extNames.36

exim_queue_check_self

SIPWISE-NGCP-MONITOR-MIB::mailQueue.*

UCD-SNMP-MIB::extNames.37

provisioned_subscribers_check_self

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterProvSubs

UCD-SNMP-MIB::extNames.38

kam_dialog_active_check_self

SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*

UCD-SNMP-MIB::extNames.39

kam_dialog_early_check_self

SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*

UCD-SNMP-MIB::extNames.40

kam_dialog_type_local_check_self

SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*

UCD-SNMP-MIB::extNames.41

kam_dialog_type_relay_check_self

SIPWISE-NGCP-MONITOR-MIB::sipDialogRelay.*

UCD-SNMP-MIB::extNames.42

kam_dialog_type_incoming_check_self

SIPWISE-NGCP-MONITOR-MIB::sipDialogIncoming.*

UCD-SNMP-MIB::extNames.43

kam_dialog_type_outgoing_check_self

SIPWISE-NGCP-MONITOR-MIB::sipDialogOutgoing.*

UCD-SNMP-MIB::extNames.44

kam_usrloc_regusers_check_self

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegSubs

UCD-SNMP-MIB::extNames.45

kam_usrloc_regdevices_check_self

SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegDevs

UCD-SNMP-MIB::extNames.46

mysql_replication_discrepancies_check_self

Obsolete

UCD-SNMP-MIB::extNames.47

kam_dialog_type_local_check_prx0X

SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*

UCD-SNMP-MIB::extNames.48

kam_dialog_type_relay_check_prx0X

SIPWISE-NGCP-MONITOR-MIB::sipDialogRelay.*

UCD-SNMP-MIB::extNames.49

kam_dialog_type_incoming_check_prx0X

SIPWISE-NGCP-MONITOR-MIB::sipDialogIncoming.*

UCD-SNMP-MIB::extNames.50

kam_dialog_type_outgoing_check_prx0X

SIPWISE-NGCP-MONITOR-MIB::sipDialogOutgoing.*

UCD-SNMP-MIB::extNames.51

kam_dialog_active_check_prx0X

SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*

UCD-SNMP-MIB::extNames.52

kam_dialog_early_check_prx0X

SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*