15. Monitoring and Alerting

15.1. Internal Monitoring

15.1.1. Service monitoring

The platform uses both systemd and monit daemons to monitor all essential services. Since Sipwise C5 runs in an active/standby mode, not all services are always running on both nodes, some of them will only run on the active node and be stopped on the standby node. The following commands show the most critical services on the platform: * ngcp-service summary - to get the list of services and their current status, * systemctl status - to get a tree of the services running, * systemctl list-units - to get a list of the service states, * monit summary - to get the list of services known to monit and their current status, * monit status - to get the list of services known to monit with detailed status.

important
important	When you perform a stop/start/monitor/unmonitor operation on a service, monit affects other services that depend on the initial one. Hence, if you stop or unmonitor a service all services that depend on it will be stopped or unmonitored as well.

For example, monit stop mysql operation will stop kamailio, sbc, asterisk, prosody and some other services. Although the recommended way to operate on services is via the ngcp-service wrapper which will take care of abstracting the underlying process monitoring implementation.

If any service ever fails for whatever reason either the systemd or monit daemons will quickly restart it. When that happens, the daemon will send a notification email to the address specified in the config.yml file under the general.adminmail key. It will also send warning emails to this address under certain abnormal conditions, such as high memory consumption (> 75% is used) or high CPU load.

important
important	In order for monit to be able to send emails to the specified address, the local MTA (exim4) must be configured correctly. If you have not done so already, run `dpkg-reconfigure exim4-config` to do this. The CE edition’s handbook contains more information about this in the Installation chapter.

15.1.2. System monitoring via Telegraf

The platform uses the internal telegraf service to monitor many aspects of the system, including CPU, memory, swap, disk, filesystem, network, processes, NTP, Nginx, Redis and MySQL.

The gathered information is stored in InfluxDB, in the telegraf database.

15.1.3. Sipwise C5 specific monitoring via ngcp-witnessd

The platform uses the internal ngcp-witnessd service to monitor Sipwise C5 specific metrics or system metrics currently not tracked by telegraf, including memory, process count, Heartbeat, MTA, Kamailio, SIP and MySQL.

The gathered information is stored in InfluxDB, in the ngcp database.

15.1.4. Monitoring data in InfluxDB

The platform uses InfluxDB as a time series database, to store most of the metrics collected in the system.

On a Sipwise C5 each node stores its own metrics and the ones for their peer node, and the management nodes store the metrics for all the nodes in the cluster. This is done via influxdb-relay which listens for InfluxDB writes and multiplexes them to the local node and any other node necessary.

The monitoring data is used by various components of the platform, including ngcp-collective-check, ngcp-snmp-agent and by the statistics dashboard powered by Grafana.

The monitoring data can also be accessed directly by various means; by using the influx command-line tool in CLI or TUI modes; by using the ngcp-influxdb-extract wrapper which provides two convenience commands to run arbitrary queries or to fetch the last value for a measurement’s field; or by using the HTTP API with curl (or other HTTP fetchers), or with the Sipwise::InfluxDB::HTTP perl module.

See https://docs.influxdata.com/influxdb/v1.1/query_language/spec/ for information about InfluxQL, the query language used by InfluxDB.

tip
tip	To get the list of all measurements for a specific database the following query can be used `SHOW MEASUREMENTS`.

tip
tip	To get the list of fields for a specific measurement the following query can be used `SELECT LAST(*) FROM "measurement"`.

tip
tip	To get the list of tags for a specific measurement the following query can be used `SHOW TAG KEYS FROM "measurement"`, and for all the current tag values for a tag `SHOW TAG VALUES FROM "measurement" WITH KEY = "tag"`.

See Section 2.1, “InfluxDB monitoring keys” for detailed information about the list of data currently stored in the InfluxDB ngcp monitoring database.

15.2. Statistics Dashboard

The platform’s administration interface (described in Section 4, “VoIP Service Configuration Scenario”) provides a graphical overview based on Grafana of the most important system health indicators, such as memory usage, load averages and disk usage. VoIP statistics, such as the number of concurrent active calls, the number of provisioned and registered subscribers, etc. is also present.

15.3. External Monitoring Using SNMP

15.3.1. Overview and Initial Setup

The Sipwise C5 exports a variety of cluster health data and statistics over the standard SNMP interface. By default, the SNMP interface can only be accessed locally. To make it possible to provide the SNMP data to an external system, the config.yml file needs to be edited and the list of allowed community names and allowed hosts/IP ranges must be populated. This list can be found under the checktools.snmpd.communities key and it consists of one or more community/source value pairs. The community is the allowed community name, while source is an IP address or an IP block where to allow the requests from.

The SNMP notifications can also be configured in a similar way, to send them to an external system, by populating the checktools.snmpd.trap_communities key with community/target value pairs. The community is the value that will be used when sending the trap, while the target is an IP address where to send the trap.

The public entries with the localhost source and target are used for local testing of SNMP functionality. It is recommended that you leave these entries in place. Other legal sources can be formed as single IP addresses or IP blocks in IP/prefix notation, for example 192.168.115.0/24. Other targets can be formed as single IP addresses.

tip
tip	To locally check if SNMP is working correctly, execute the command `snmpwalk -v2c -cpublic localhost .` (note the trailing dot). This will generate a long list of raw SNMP OIDs and their values, provided that the `default` SNMP community key has been left in place.

tip
tip	To locally check if SNMP notifications (or traps) are working correctly, install the snmptrapd package, which will be configured by default to catch the traps sent by the localhost SNMP agent. The traps will show up on `/var/log/daemon.log`, and a couple of traps can be generated simply by running `service snmpd restart`.

INFO: SNMP version 1 and version 2c are supported.

15.3.2. Details

There are two types of information that can be retrieved from SNMP. The first one is the native Sipwise C5 cluster overview from Sipwise C5 MIBs (Management Information Bases). The second is the legacy ad-hoc information using the Net-SNMP extension OIDs, and detailed information for the node running the SNMP daemon using standard OIDs (Object Identifiers).

15.3.2.1. Sipwise C5 OIDs

The entire Sipwise C5 cluster can be monitored by using the SIPWISE-NGCP-MIB, SIPWISE-NGCP-MONITOR-MIB and SIPWISE-NGCP-STATS-MIB. These OIDs are rooted at Sipwise C5 slot .1.3.6.1.4.1.34274.1.*.

The MIBs are self-documented, and can be found as part of the ngcp-snmp-mibs package (running dpkg -S SIPWISE*MIB will list their pathnames). The Sipwise C5 SNMP Agent is a part of the ngcp-snmp-agent package, which is installed by default and works out-of-the-box as long as the snmpd has been properly configured.

The SIPWISE-NGCP-MIB acts as the root MIB and provides information about the cluster licensing and layout (which is mostly static data about each node, such as node name, its IP address, its roles, etc.) and information required to access the OIDs from the other MIBs.

The SIPWISE-NGCP-MONITOR-MIB provides current monitoring information, global health conditions, the number of provisioned and registered subscribers and devices. It also provides per node information (independently of the number of nodes or their names) on their filesystem, processes, databases, system load, memory, heartbeat status, MTA queues, etc.

The SIPWISE-NGCP-STATS-MIB provides accumulated statistics on billing, performance and processed SIP messages.

NOTICE: OIDs under the following trees are not yet implemented: ngcpMonitorFraud, ngcpMonitorPerformance.perfCAPSCurTable and ngcpStats.

INFO: The Sipwise C5 SNMP Agent uses Redis and InfluxDB as data sources. This data is essential for accurate and complete monitoring data in the SNMP OID tree. In addition, the Redis database must be available on a shared IP address, so that ngcp-witnessd can always write to it.

15.3.2.2. Legacy OIDs

info
info	The following OIDs have been superseded by Sipwise C5 OIDs, but they are still provided for backwards compatibility.

All basic system health variables (such as memory, disk, swap, CPU usage, network statistics, process lists, etc.) for the mgmt node can be found in standard OID slots from standard MIBs. For example, memory statistics can be found through the UCD-SNMP-MIB in OIDs such as memTotalSwap.0, memAvailSwap.0, memTotalReal.0, memAvailReal.0, etc., which translate to numeric OIDs .1.3.6.1.4.1.2021.4.*. In fact, UCD-SNMP-MIB is the most useful MIB for overall system health checks.

Additionally, there’s a list of specially monitored processes, also found through the UCD-SNMP-MIB. UCD-SNMP-MIB::prNames (.1.3.6.1.4.1.2021.2.1.2) gives the list of monitored processes, prCount (.1.3.6.1.4.1.2021.2.1.5) is how many of each process are running and prErrorFlag (.1.3.6.1.4.1.2021.2.1.100) gives a 0/1 error indication (with prErrMessage (.1.3.6.1.4.1.2021.2.1.101) providing an explanation of any error).

tip
tip	Some of these processes are not supposed to be running on the standby node, so you’ll see the error flag raised there. A possible solution is to run these SNMP checks against the shared service IP of the cluster.

Furthermore, UCD-SNMP-MIB provides a list of custom external checks. The names of these can be found under the UCD-SNMP-MIB::extNames (.2) tree, with extOutput (.101) providing the output (one line) from each check and extResult (.100) the exit code from each check.

The first of these external checks called collective_check provides a combined and overall system health status indicator. It gathers information from both nodes and returns 0 in extResult.1 (.100.1) if everything is OK and running as it should. If it finds a problem somewhere, but with the system still operational (e.g. a service is stopped on the inactive node), extResult.1 will return 1 and extOutput.1 will be set to a string that can be used to diagnose the problem. In case the system is found in a critical and non-operational state, extResult.1 will return 2, again with an error message set. If you want to keep it really simple, you can just monitor this one OID and raise an alarm if it ever goes to non-zero.

INFO: The 0/1/2 status codes allow for easy integration with Nagios.

The remaining external checks simply return statistics on the system, they all return a number in extOutput and have extResult always set to zero.

The full list of such checks is below. All of these checks have three modes: the first returns the statistics from sp1 (the first node in Sipwise C5 pair), the second - from sp2, and the third - from whichever node is being queried (which is useful when querying the shared service IP). For example, the local SIP response time from sp1 is in sip_check_sp1, from sp2 - is in sip_check_sp2, and from the host itself - is in sip_check_self.

The base OID of the Result and Output OIDs is always .1.3.6.1.4.1.2021.8.1, so if you read .100.1, the full OID is .1.3.6.1.4.1.2021.8.1.100.1.

Name in MIB	Result OID	Output OID	Name	Description
UCD-SNMP-MIB::extNames.1	.100.1	.101.1	collective_check	Summarized platform check
UCD-SNMP-MIB::extNames.2	.100.2	.101.2	sip_check_sp1	SIP response time in seconds on sp1
UCD-SNMP-MIB::extNames.3	.100.3	.101.3	sip_check_sp2	SIP response time in seconds on sp2
UCD-SNMP-MIB::extNames.4	.100.4	.101.4	mysql_check_sp1	Average number of MySQL queries per second on sp1
UCD-SNMP-MIB::extNames.5	.100.5	.101.5	mysql_check_sp2	Average number of MySQL queries per second on sp2
UCD-SNMP-MIB::extNames.6	.100.6	.101.6	mysql_replication_check_sp1	MySQL replication delay in seconds on sp1
UCD-SNMP-MIB::extNames.7	.100.7	.101.7	mysql_replication_check_sp2	MySQL replication delay in seconds on sp2
UCD-SNMP-MIB::extNames.8	.100.8	.101.8	mpt_check_sp1	RAID status on sp1
UCD-SNMP-MIB::extNames.9	.100.9	.101.9	mpt_check_sp2	RAID status on sp2
UCD-SNMP-MIB::extNames.10	.100.10	.101.10	exim_queue_check_sp1	Number of mails undelivered in MTA queue on sp1
UCD-SNMP-MIB::extNames.11	.100.11	.101.11	exim_queue_check_sp2	Number of mails undelivered in MTA queue on sp2
UCD-SNMP-MIB::extNames.12	.100.12	.101.12	provisioned_subscribers_check_sp1	Number of subscribers provisioned on sp1
UCD-SNMP-MIB::extNames.13	.100.13	.101.13	provisioned_subscribers_check_sp2	Number of subscribers provisioned on sp2
UCD-SNMP-MIB::extNames.14	.100.14	.101.14	kam_dialog_active_check_sp1	Number of active calls on sp1
UCD-SNMP-MIB::extNames.15	.100.15	.101.15	kam_dialog_active_check_sp2	Number of active calls on sp2
UCD-SNMP-MIB::extNames.16	.100.16	.101.16	kam_dialog_early_check_sp1	Number of calls in Early Media state on sp1
UCD-SNMP-MIB::extNames.17	.100.17	.101.17	kam_dialog_early_check_sp2	Number of calls in Early Media state on sp2
UCD-SNMP-MIB::extNames.18	.100.18	.101.18	kam_dialog_type_local_check_sp1	Number of active calls local on sp1
UCD-SNMP-MIB::extNames.19	.100.19	.101.19	kam_dialog_type_local_check_sp2	Number of active calls local on sp2
UCD-SNMP-MIB::extNames.20	.100.20	.101.20	kam_dialog_type_relay_check_sp1	Number of active calls routed via peers on sp1
UCD-SNMP-MIB::extNames.21	.100.21	.101.21	kam_dialog_type_relay_check_sp2	Number of active calls routed via peers on sp2
UCD-SNMP-MIB::extNames.22	.100.22	.101.22	kam_dialog_type_incoming_check_sp1	Number of incoming calls on sp1
UCD-SNMP-MIB::extNames.23	.100.23	.101.23	kam_dialog_type_incoming_check_sp2	Number of incoming calls on sp2
UCD-SNMP-MIB::extNames.24	.100.24	.101.24	kam_dialog_type_outgoing_check_sp1	Number of outgoing calls on sp1
UCD-SNMP-MIB::extNames.25	.100.25	.101.25	kam_dialog_type_outgoing_check_sp2	Number of outgoing calls on sp2
UCD-SNMP-MIB::extNames.26	.100.26	.101.26	kam_usrloc_regusers_check_sp1	Number of subscribers with at least one active registration on sp1
UCD-SNMP-MIB::extNames.27	.100.27	.101.27	kam_usrloc_regusers_check_sp2	Number of subscribers with at least one active registration on sp2
UCD-SNMP-MIB::extNames.28	.100.28	.101.28	kam_usrloc_regdevices_check_sp1	Total number of registered end devices on sp1
UCD-SNMP-MIB::extNames.29	.100.29	.101.29	kam_usrloc_regdevices_check_sp2	Total number of registered end devices on sp2
UCD-SNMP-MIB::extNames.30	.100.30	.101.30	mysql_replication_discrepancies_check_sp1	Number of MySQL tables not in sync between sp1 and sp2
UCD-SNMP-MIB::extNames.31	.100.31	.101.31	mysql_replication_discrepancies_check_sp2	Number of MySQL tables not in sync between sp1 and sp2
UCD-SNMP-MIB::extNames.32	.100.32	.101.32	sip_check_self	Summarized platform check on active node
UCD-SNMP-MIB::extNames.33	.100.33	.101.33	mysql_check_self	Average number of MySQL queries per second on active node
UCD-SNMP-MIB::extNames.34	.100.34	.101.34	mysql_replication_check_self	MySQL replication delay in seconds on active node
UCD-SNMP-MIB::extNames.35	.100.35	.101.35	mpt_check_self	RAID status on active node
UCD-SNMP-MIB::extNames.36	.100.36	.101.36	exim_queue_check_self	Number of mails undelivered in MTA queue on active node
UCD-SNMP-MIB::extNames.37	.100.37	.101.37	provisioned_subscribers_check_self	Number of subscribers provisioned on active node
UCD-SNMP-MIB::extNames.44	.100.44	.101.44	kam_usrloc_regusers_check_self	Number of subscribers with at least one active registration on active node
UCD-SNMP-MIB::extNames.45	.100.45	.101.45	kam_usrloc_regdevices_check_self	Total number of registered end devices on active node
UCD-SNMP-MIB::extNames.46	.100.46	.101.46	mysql_replication_discrepancies_check_self	Number of MySQL tables not in sync between sp1 and sp2
UCD-SNMP-MIB::extNames.47	.100.47	.101.47	kam_dialog_type_local_check_prx0X	Number of active local calls on active proxy X
UCD-SNMP-MIB::extNames.48	.100.48	.101.48	kam_dialog_type_relay_check_prx0X	Number of active calls routed via peers on active proxy X
UCD-SNMP-MIB::extNames.49	.100.49	.101.49	kam_dialog_type_incoming_check_prx0X	Number of incoming calls on active proxy X
UCD-SNMP-MIB::extNames.50	.100.50	.101.50	kam_dialog_type_outgoing_check_prx0X	Number of outgoing calls on active proxy X
UCD-SNMP-MIB::extNames.51	.100.51	.101.51	kam_dialog_active_check_prx0X	Number of active calls on active proxy X
UCD-SNMP-MIB::extNames.52	.100.52	.101.52	kam_dialog_early_check_prx0X	Number of calls in Early Media state on active proxy X

tip
tip	Some of the checks can be disabled (most are enabled by default) through the `config.yml` file, and those checks will then return an error message or an empty string in their `extOutput`. Enable those checks in the config file to get their output in the SNMP OID tree. The enable/disable flags can be found in the `checktools` section.