17. Monitoring and Alerting

17.1. Internal Monitoring

17.1.1. Service monitoring

The platform uses both systemd and monit daemons to monitor all essential services. Since Sipwise C5 runs in an active/standby mode, not all services are always running on both nodes, some of them will only run on the active node and be stopped on the standby node. The following commands show the most critical services on the platform:

ngcp-service summary - to get the list of services and their current status,
systemctl status - to get a tree of the services running,
systemctl list-units - to get a list of the service states,
monit summary - to get the list of services known to monit and their current status,
monit status - to get the list of services known to monit with detailed status.

important
important	When you perform a stop/start/monitor/unmonitor operation on a service, monit affects other services that depend on the initial one. Hence, if you stop or unmonitor a service all services that depend on it will be stopped or unmonitored as well.

For example, monit stop mysql operation will stop kamailio, sbc, asterisk, prosody and some other services. Although the recommended way to operate on services is via the ngcp-service wrapper which will take care of abstracting the underlying process monitoring implementation.

If any service ever fails for whatever reason either the systemd or monit daemons will quickly restart it. When that happens, the daemon will send a notification email to the address specified in the config.yml file under the general.adminmail key. It will also send warning emails to this address under certain abnormal conditions, such as high memory consumption (> 75% is used) or high CPU load.

important
important	In order for monit to be able to send emails to the specified address, the local MTA (exim4) must be configured correctly. The CE edition’s handbook contains more information about this in the Installation chapter.

17.1.2. System monitoring backend

The platform uses the Prometheus monitoring backend on new installations and on upgraded systems that have been migrated. On older systems the monitoring backend was InfluxDB, which is now deprecated.

The platform uses various monitoring backend services to monitor many aspects of the system, including CPU, memory, swap, disk, filesystem, network, processes, NTP, Nginx, Redis and MySQL.

The gathered information is stored in VictoriaMetrics which is a long-term storage backend for Prometheus. NOTE: Both VictoriaMetrics and Prometheus can act as the prometheus server implementation, and are mutually exclusive in their execution. On systems still using InfluxDB the information is stored in the telegraf database.

17.1.3. Sipwise C5 specific monitoring via ngcp-witnessd

The platform uses the internal ngcp-witnessd service to monitor Sipwise C5 specific metrics or system metrics currently not tracked by the monitoring backend (either Prometheus exporters or the telegraf service when using the deprecated InfluxDB), including HA status, MTA, Kamailio, SIP and MySQL.

The gathered information is stored in VictoriaMetrics in the ngcp namespace, or in InfluxDB in the ngcp database.

tip

Some of the data gathering can be disabled (most are enabled by default) through the config.yml file, and those data points will then either be missing from the database or be initialized with a stub value. This will then cascade into other subsystems using this monitoring information, such as Grafana dashboards or SNMP OIDs. The enable/disable flags can be found in the witnessd.gather section.

17.1.4. Monitoring data in the monitoring backend

The platform uses VictoriaMetrics as a long-term Prometheus time series database to store most of the metrics collected in the system. On systems still using InfluxDB the time series databases role is filled by InfluxDB itself.

On a Sipwise C5 each node stores its own metrics and the ones for their peer node, and in addition on CARRIER systems the management nodes store the metrics for all the nodes in the cluster. On new installations and migrated ones this is done with Prometheus instances on each peer, and a VictoriaMetrics instance on the management node which uses its Prometheus federation and scrapping support. On older installations this is done with influxdb-relay which listens for InfluxDB writes and multiplexes them to the local node and any other node necessary.

The monitoring data is used by various components of the platform, including ngcp-collective-check, ngcp-snmp-agent and by the statistics dashboard powered by Grafana.

The monitoring data can also be accessed directly by various means. On new installations by using the promtool command-line tool; or by using the HTTP API with curl (or other HTTP fetchers), or with the NGCP::Prometheus::HTTP perl module. On old installations by using the influx command-line tool in CLI or TUI modes; by using the ngcp-influxdb-extract wrapper which provides two convenience commands to run arbitrary queries or to fetch the last value for a measurement’s field; or by using the HTTP API with curl (or other HTTP fetchers), or with the NGCP::InfluxDB::HTTP perl module.

17.1.4.1. Monitoring metrics

See Section 4, “Prometheus monitoring metrics” for detailed information about the list of ngcp namespaced metrics stored in the Prometheus monitoring database.

See Section 5, “InfluxDB monitoring keys” for detailed information about the list of data stored in the InfluxDB ngcp monitoring database.

17.1.4.2. PromQL

See https://prometheus.io/docs/prometheus/latest/querying/basics/ for information about PromQL, the query language used by Prometheus.

tip
tip	To get the list of all metrics for a specific namespace the following query can be used `{name=~"^namespace_.+"}`.

17.1.4.3. InfluxQL

See https://docs.influxdata.com/influxdb/v1.1/query_language/spec/ for information about InfluxQL, the query language used by InfluxDB.

tip
tip	To get the list of all measurements for a specific database the following query can be used `SHOW MEASUREMENTS`.

17.2. Statistics Dashboard

The platform’s administration interface (described in Section 5, “VoIP Service Configuration Scenario”) provides a graphical overview based on Grafana of the most important system health indicators, such as memory usage, load averages and disk usage. VoIP statistics, such as the number of concurrent active calls, the number of provisioned and registered subscribers, etc. is also present.

17.3. External Monitoring Using SNMP

17.3.1. Overview and Initial Setup

The Sipwise C5 exports a variety of cluster health data and statistics over the standard SNMP interface. By default, the SNMP interface can only be accessed locally. To make it possible to provide the SNMP data to an external system, the config.yml file needs to be edited and the list of allowed community names and allowed hosts/IP ranges must be populated. This list can be found under the snmpd.communities key and it consists of one or more hashes of name and sources key/values. The community name is the allowed community name, while sources is a list of IP address or IP blocks where to allow the requests from.

The SNMP notifications (or traps) can also be configured in a similar way, to send them to an external system, by populating the snmpd.trap_communities key with name and targets key/values. The community trap name is the value that will be used when sending the trap, while the targets is a list of IP addresses where to send the trap.

The public communities with the localhost source and target are used for local testing of SNMP functionality. It is recommended that you leave these entries in place. Other legal sources can be formed as single IP addresses or IP blocks in IP/prefix notation, for example 192.168.115.0/24. Other targets can be formed as single IP addresses.

The origin of the SNMP notifications for the SIPWISE MIB can also be configured with the snmpagent.traps_origin. The supported modes are:

legacy: The node triggering the condition and its peer (if available) will emit the trap, in addition the management node pair (if distinct) will also emit the trap. This is the original behavior and the current default.
mgmt: Only the active management node will emit the trap.
distributed: Only the node triggering the condition will emit the trap. For cluster-wide conditions, this mode is equivalent to the mgmt mode.

tip
tip	To locally check if SNMP is working correctly, execute the command `snmpwalk -v2c -cpublic localhost .` (note the trailing dot). This will generate a long list of raw SNMP OIDs and their values, provided that the `default` SNMP community key has been left in place.

tip
tip	To locally check if SNMP notifications (or traps) are working correctly, install the snmptrapd package, which will be configured by default to catch the traps sent by the localhost SNMP agent. The traps will show up on `/var/log/daemon.log`, and a couple of traps can be generated by running `ngcp-service restart snmpd`.

info
info	SNMP version 1 and version 2c are supported.

17.3.2. Details

There are two kinds of information that can be retrieved from SNMP OIDs (Object Identifiers). The first one is the native Sipwise C5 cluster overview from Sipwise C5 MIBs (Management Information Bases), which is available from the management nodes. The second is from the stock snmpd implementing the UCD (University of California, Davis) MIBs, which requires querying each individual node.

17.3.2.1. Sipwise C5 OIDs

The entire Sipwise C5 cluster can be monitored from the management nodes by using the SIPWISE-NGCP-MIB and SIPWISE-NGCP-MONITOR-MIB (SIPWISE-NGCP-STATS-MIB is deprecated and should not be used anymore). These OIDs are rooted at Sipwise C5 slot .1.3.6.1.4.1.34274.1.*.

The MIBs are self-documented, and can be found as part of the ngcp-snmp-mibs package (running dpkg -S SIPWISE*MIB will list their pathnames). The Sipwise C5 SNMP Agent is a part of the ngcp-snmp-agent package, which is installed by default and works out-of-the-box as long as the snmpd has been properly configured.

The SIPWISE-NGCP-MIB acts as the root MIB and provides information about the cluster licensing and layout (which is mostly static data about each node, such as node name, its IP address, its roles, etc.) and information required to access the OIDs from the other MIBs.

The SIPWISE-NGCP-MONITOR-MIB provides current monitoring information, global health conditions, the number of provisioned and registered subscribers and devices. It also provides per node information (independently of the number of nodes or their names) on their filesystem, processes, databases, system load, memory, HA status, MTA queues, etc.

The SIPWISE-NGCP-STATS-MIB is deprecated and has been superseded by the SIPWISE-NGCP-MONITOR-MIB.

info
info	OIDs under the following trees are not yet implemented: ngcpMonitorFraud, ngcpMonitorPerformance.sipStatsTable.sipCallAttemptsPerSecond. Deprecated OIDs are currently implemented but will eventually be obsoleted. Obsolete OIDs are not implemented and won’t be in the future.

info
info	The Sipwise C5 SNMP Agent uses Redis and Prometheus or InfluxDB as data sources. This data is essential for accurate and complete monitoring data in the SNMP OID tree. In addition, the Redis database must be available on a shared IP address, so that ngcp-witnessd can always write to it.

17.3.2.2. UCD OIDs

All basic system health variables (such as memory, disk, swap, CPU usage, network statistics, process lists, etc.) for every node can also be found in standard OID slots from standard MIBs from each node. For example, memory statistics can be found through the UCD-SNMP-MIB in OIDs such as memTotalSwap.0, memAvailSwap.0, memTotalReal.0, memAvailReal.0, etc., which translate to numeric OIDs .1.3.6.1.4.1.2021.4.*. In fact, UCD-SNMP-MIB is a useful MIB for overall non-centralized system health checks.

Additionally, there is a list of specially monitored processes, also found through the UCD-SNMP-MIB. UCD-SNMP-MIB::prNames (.1.3.6.1.4.1.2021.2.1.2) gives the list of monitored processes, prCount (.1.3.6.1.4.1.2021.2.1.5) is how many of each process are running and prErrorFlag (.1.3.6.1.4.1.2021.2.1.100) gives a 0/1 error indication (with prErrMessage (.1.3.6.1.4.1.2021.2.1.101) providing an explanation of any error).

tip
tip	Some of these processes are not supposed to be running on the standby node, so you will see the error flag raised there. A possible solution is to run these SNMP checks against the shared service IP of the cluster. See in Section 2.7, “High Availability and Fail-Over” below for more information.

important

Furthermore, Sipwise C5 used to provide platform specific information via the UCD-SNMP-MIB custom external extension OIDs, which have been superseded by the Sipwise MIBs, and need to be migrated to use the latter. The names of these OIDs could be found under the UCD-SNMP-MIB::extNames (.1.3.6.1.4.1.2021.8.1.2) tree, with extOutput (.1.3.6.1.4.1.2021.8.1.101) providing the output (one line) from each check and extResult (.1.3.6.1.4.1.2021.8.1.100) the exit code from each check. The following table gives a rough mapping for that migration:

UCD OID name	UCD check name	SIPWISE-NGCP OID name
UCD-SNMP-MIB::extNames.1	collective_check	SIPWISE-NGCP-MONITOR-MIB::ngcpCollectiveCheckResult and SIPWISE-NGCP-MONITOR-MIB::ngcpCollectiveCheckOutput
UCD-SNMP-MIB::extNames.2	sip_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipResponsiveness.*
UCD-SNMP-MIB::extNames.3	sip_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipResponsiveness.*
UCD-SNMP-MIB::extNames.4	mysql_check_sp1	SIPWISE-NGCP-MONITOR-MIB::dbQueryRate.*
UCD-SNMP-MIB::extNames.5	mysql_check_sp2	SIPWISE-NGCP-MONITOR-MIB::dbQueryRate.*
UCD-SNMP-MIB::extNames.6	mysql_replication_check_sp1	SIPWISE-NGCP-MONITOR-MIB::dbReplDelay.*
UCD-SNMP-MIB::extNames.7	mysql_replication_check_sp2	SIPWISE-NGCP-MONITOR-MIB::dbReplDelay.*
UCD-SNMP-MIB::extNames.8	mpt_check_sp1	Obsolete
UCD-SNMP-MIB::extNames.9	mpt_check_sp2	Obsolete
UCD-SNMP-MIB::extNames.10	exim_queue_check_sp1	SIPWISE-NGCP-MONITOR-MIB::mailQueue.*
UCD-SNMP-MIB::extNames.11	exim_queue_check_sp2	SIPWISE-NGCP-MONITOR-MIB::mailQueue.*
UCD-SNMP-MIB::extNames.12	provisioned_subscribers_check_sp1	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterProvSubs
UCD-SNMP-MIB::extNames.13	provisioned_subscribers_check_sp2	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterProvSubs
UCD-SNMP-MIB::extNames.14	kam_dialog_active_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*
UCD-SNMP-MIB::extNames.15	kam_dialog_active_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*
UCD-SNMP-MIB::extNames.16	kam_dialog_early_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*
UCD-SNMP-MIB::extNames.17	kam_dialog_early_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*
UCD-SNMP-MIB::extNames.18	kam_dialog_type_local_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*
UCD-SNMP-MIB::extNames.19	kam_dialog_type_local_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*
UCD-SNMP-MIB::extNames.20	kam_dialog_type_relay_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipDdialogRelay.*
UCD-SNMP-MIB::extNames.21	kam_dialog_type_relay_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipDdialogRelay.*
UCD-SNMP-MIB::extNames.22	kam_dialog_type_incoming_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipDdialogIncoming.*
UCD-SNMP-MIB::extNames.23	kam_dialog_type_incoming_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipDdialogIncoming.*
UCD-SNMP-MIB::extNames.24	kam_dialog_type_outgoing_check_sp1	SIPWISE-NGCP-MONITOR-MIB::sipDdialogOutgoing.*
UCD-SNMP-MIB::extNames.25	kam_dialog_type_outgoing_check_sp2	SIPWISE-NGCP-MONITOR-MIB::sipDdialogOutgoing.*
UCD-SNMP-MIB::extNames.26	kam_usrloc_regusers_check_sp1	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegSubs
UCD-SNMP-MIB::extNames.27	kam_usrloc_regusers_check_sp2	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegSubs
UCD-SNMP-MIB::extNames.28	kam_usrloc_regdevices_check_sp1	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegDevs
UCD-SNMP-MIB::extNames.29	kam_usrloc_regdevices_check_sp2	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegDevs
UCD-SNMP-MIB::extNames.30	mysql_replication_discrepancies_check_sp1	SIPWISE-NGCP-MONITOR-MIB::dbReplDiff.*
UCD-SNMP-MIB::extNames.31	mysql_replication_discrepancies_check_sp2	SIPWISE-NGCP-MONITOR-MIB::dbReplDiff.*
UCD-SNMP-MIB::extNames.32	sip_check_self	SIPWISE-NGCP-MONITOR-MIB::sipResponsiveness.*
UCD-SNMP-MIB::extNames.33	mysql_check_self	SIPWISE-NGCP-MONITOR-MIB::dbQueryRate.*
UCD-SNMP-MIB::extNames.34	mysql_replication_check_self	SIPWISE-NGCP-MONITOR-MIB::dbReplDelay.*
UCD-SNMP-MIB::extNames.35	mpt_check_self	Obsolete
UCD-SNMP-MIB::extNames.36	exim_queue_check_self	SIPWISE-NGCP-MONITOR-MIB::mailQueue.*
UCD-SNMP-MIB::extNames.37	provisioned_subscribers_check_self	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterProvSubs
UCD-SNMP-MIB::extNames.38	kam_dialog_active_check_self	SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*
UCD-SNMP-MIB::extNames.39	kam_dialog_early_check_self	SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*
UCD-SNMP-MIB::extNames.40	kam_dialog_type_local_check_self	SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*
UCD-SNMP-MIB::extNames.41	kam_dialog_type_relay_check_self	SIPWISE-NGCP-MONITOR-MIB::sipDialogRelay.*
UCD-SNMP-MIB::extNames.42	kam_dialog_type_incoming_check_self	SIPWISE-NGCP-MONITOR-MIB::sipDialogIncoming.*
UCD-SNMP-MIB::extNames.43	kam_dialog_type_outgoing_check_self	SIPWISE-NGCP-MONITOR-MIB::sipDialogOutgoing.*
UCD-SNMP-MIB::extNames.44	kam_usrloc_regusers_check_self	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegSubs
UCD-SNMP-MIB::extNames.45	kam_usrloc_regdevices_check_self	SIPWISE-NGCP-MONITOR-MIB::ngcpClusterRegDevs
UCD-SNMP-MIB::extNames.46	mysql_replication_discrepancies_check_self	SIPWISE-NGCP-MONITOR-MIB::dbReplDiff.*
UCD-SNMP-MIB::extNames.47	kam_dialog_type_local_check_prx0X	SIPWISE-NGCP-MONITOR-MIB::sipDialogLocal.*
UCD-SNMP-MIB::extNames.48	kam_dialog_type_relay_check_prx0X	SIPWISE-NGCP-MONITOR-MIB::sipDialogRelay.*
UCD-SNMP-MIB::extNames.49	kam_dialog_type_incoming_check_prx0X	SIPWISE-NGCP-MONITOR-MIB::sipDialogIncoming.*
UCD-SNMP-MIB::extNames.50	kam_dialog_type_outgoing_check_prx0X	SIPWISE-NGCP-MONITOR-MIB::sipDialogOutgoing.*
UCD-SNMP-MIB::extNames.51	kam_dialog_active_check_prx0X	SIPWISE-NGCP-MONITOR-MIB::sipDialogActive.*
UCD-SNMP-MIB::extNames.52	kam_dialog_early_check_prx0X	SIPWISE-NGCP-MONITOR-MIB::sipEarlyMedia.*