16. Monitoring and Alerting

16.1. Internal Monitoring
16.2. Statistics Dashboard
16.3. External Monitoring Using SNMP
16.3.1. Overview and Initial Setup
16.3.2. Details

16.1. Internal Monitoring

The platform uses the monit daemon internally to monitor all essential services. Since the sip:provider PRO runs in an active/standby mode, not all services are always running on both nodes, some of them will only run on the active node and be stopped on the standby node. At any time, you can use the command monit summary to get a list of all services and their current status, or monit status for the same list with more detail.

[Important]

sip:provider PRO has a monit services dependencies since mr3.5.1. Services specified in a depend statement will be checked during stop/start/monitor/unmonitor operations. If a service is stopped or unmonitored it will stop/unmonitor any services that depends on itself. Which means that kamailio/sbc/asterisk/prosody/… will be stopped on monit stop mysql operation.

The monit daemon takes care of quickly restarting a service should it ever fail for whatever reason. When that happens, the deamon will send a notification email to the address specified in the config.yml file under the key general.adminmail. It will also send warning emails to this address under certain abnormal conditions, such as when the system is low on memory (> 75% used) or under high-load conditions.

[Important]

In order for monit to be able to send email to the specified address, the local MTA (exim4) must be configured correctly. If you haven’t done so already, run dpkg-reconfigure exim4-config to do this. The CE edition’s handbook contains more information about this in the Installation chapter.

16.2. Statistics Dashboard

The platform’s administration interface (described in Section 5, “Administrative Configuration”) provides a simple graphical overview of the most important system health data points, such as memory usage, load averages and disk usage, as well as statistics about the VoIP system itself, such as the number of concurrent active calls, number of provisioned and registered subscribers, etc.

16.3. External Monitoring Using SNMP

16.3.1. Overview and Initial Setup

The sip:provider PRO exports a variety of system health data and statistics over standard SNMP. By default, the SNMP interface can only be accessed locally. To make it possible to poll the SNMP data from an external system, the config.yml file needs to be edited and the list of allowed community names and allowed hosts/IP ranges must be populated. This list can be found under the checktools.snmpd.communities key and consists of one or more community/source value pairs. The community is the SNMP community string to be allowed, while source is the IP address or IP block to allow this community from. A source of default equals the IP address 127.0.0.1. Other legal values are single IP addresses or IP blocks in IP/prefix notation, for example 192.168.115.0/24. It is recommended that you leave the default entry (public and default) in place for local testing of SNMP functionality.

[Tip]

To locally check if SNMP is working correctly, execute the command snmpwalk -v2c -cpublic localhost . (note the trailing dot), assuming the default SNMP community entry has been left in place. This will generate a long list of raw SNMP OIDs and their values.

[Tip]

SNMP version 1 and version 2c are supported.

16.3.2. Details

All basic system health variables (such as memory, disk, swap, CPU usage, network statistics, process lists, etc) can be found in standard OID slots from standard MIBs. For example, memory statistics can be found through the UCD-SNMP-MIB in OIDs such as memTotalSwap.0, memAvailSwap.0, memTotalReal.0, memAvailReal.0+, etc., which translate to numeric OIDs .1.3.6.1.4.1.2021.4.*. In fact, UCD-SNMP-MIB++ is the most useful MIB for overall system health checks.

Additionally, there’s a list of specially monitored processes, also found through the UCD-SNMP-MIB. UCD-SNMP-MIB::prNames (.1.3.6.1.4.1.2021.2.1.2) gives the list of monitored processes, prCount (.1.3.6.1.4.1.2021.2.1.5) is how many of each process are running and prErrorFlag (.1.3.6.1.4.1.2021.2.1.100) gives a 0/1 error indication (with prErrMessage (.1.3.6.1.4.1.2021.2.1.101) providing an explanation of any error).

[Tip]

Some of these processes are not supposed to be running on the standby node, so you’ll see the error flag raised there. A possible solution is to run these SNMP checks against the shared service IP of the cluster. See in Section 2.2, “High Availability and Fail-Over” below for more information. Furthermore, UCD-SNMP-MIB provides a list of custom, external checks. The names of these can be found under the UCD-SNMP-MIB::extNames (.2) tree, with extOutput (.101) providing the output (one line) from each check and extResult (.100) the exit code from each check.

The first of these external checks called collective_check provides a combined and overall system health status indicator. It gathers information from both nodes and returns 0 in extResult.1 (.100.1) if everything is OK and running as it should. If it finds a problem somewhere, but with the system still operational (e.g. a service is stopped on the inactive node), extResult.1 will return 1 and extOutput.1 will be set to a string that can be used to diagnose the problem. In case the system is found in a critical and non-operational state, extResult.1 will return 2, again with an error message set. If you want to keep it really simple, you can just monitor this one OID and raise an alarm if it ever goes to non-zero.

[Tip]

The 0/1/2 status codes allow for easy integration with Nagios.

The remaining external checks simply return statistics about the system, they all return a number in extOutput and have extResult always set to zero.

The full list of such checks is below. All of these checks exist in three flavors: the first returns the statistics from sp1 (the first node in the sip:provider PRO pair), the second from sp2, and the third from whichever node is being queried (which is useful when querying the shared service IP). For example, the local SIP response time from sp1 is in sip_check_sp1, from sp2 is in sip_check_sp2 and from the host itself in sip_check_self.

The base OID of the Result and Output OID is always .1.3.6.1.4.1.2021.8.1, so if you read .100.1, the full OID is .1.3.6.1.4.1.2021.8.1.100.1.

Name in MIB Result OID Output OID Name Description

UCD-SNMP-MIB::extNames.1

.100.1

.101.1

collective_check

Summarized platform check

UCD-SNMP-MIB::extNames.2

.100.2

.101.2

sip_check_sp1

SIP response time in seconds on sp1

UCD-SNMP-MIB::extNames.3

.100.3

.101.3

sip_check_sp2

SIP response time in seconds on sp2

UCD-SNMP-MIB::extNames.4

.100.4

.101.4

mysql_check_sp1

Average number of MySQL queries per second on sp1

UCD-SNMP-MIB::extNames.5

.100.5

.101.5

mysql_check_sp2

Average number of MySQL queries per second on sp2

UCD-SNMP-MIB::extNames.6

.100.6

.101.6

mysql_replication_check_sp1

MySQL replication delay in seconds on sp1

UCD-SNMP-MIB::extNames.7

.100.7

.101.7

mysql_replication_check_sp2

MySQL replication delay in seconds on sp2

UCD-SNMP-MIB::extNames.8

.100.8

.101.8

mpt_check_sp1

RAID status on sp1

UCD-SNMP-MIB::extNames.9

.100.9

.101.9

mpt_check_sp2

RAID status on sp2

UCD-SNMP-MIB::extNames.10

.100.10

.101.10

exim_queue_check_sp1

Number of mails undelivered in MTA queue on sp1

UCD-SNMP-MIB::extNames.11

.100.11

.101.11

exim_queue_check_sp2

Number of mails undelivered in MTA queue on sp2

UCD-SNMP-MIB::extNames.12

.100.12

.101.12

provisioned_subscribers_check_sp1

Number of subscribers provisioned on sp1

UCD-SNMP-MIB::extNames.13

.100.13

.101.13

provisioned_subscribers_check_sp2

Number of subscribers provisioned on sp2

UCD-SNMP-MIB::extNames.14

.100.14

.101.14

kam_dialog_active_check_sp1

Number of active calls on sp1

UCD-SNMP-MIB::extNames.15

.100.15

.101.15

kam_dialog_active_check_sp2

Number of active calls on sp2

UCD-SNMP-MIB::extNames.16

.100.16

.101.16

kam_dialog_early_check_sp1

Number of calls in Early Media state on sp1

UCD-SNMP-MIB::extNames.17

.100.17

.101.17

kam_dialog_early_check_sp2

Number of calls in Early Media state on sp2

UCD-SNMP-MIB::extNames.18

.100.18

.101.18

kam_dialog_type_local_check_sp1

Number of active calls local on sp1

UCD-SNMP-MIB::extNames.19

.100.19

.101.19

kam_dialog_type_local_check_sp2

Number of active calls local on sp2

UCD-SNMP-MIB::extNames.20

.100.20

.101.20

kam_dialog_type_relay_check_sp1

Number of active calls routed via peers on sp1

UCD-SNMP-MIB::extNames.21

.100.21

.101.21

kam_dialog_type_relay_check_sp2

Number of active calls routed via peers on sp2

UCD-SNMP-MIB::extNames.22

.100.22

.101.22

kam_dialog_type_incoming_check_sp1

Number of incoming calls on sp1

UCD-SNMP-MIB::extNames.23

.100.23

.101.23

kam_dialog_type_incoming_check_sp2

Number of incoming calls on sp2

UCD-SNMP-MIB::extNames.24

.100.24

.101.24

kam_dialog_type_outgoing_check_sp1

Number of outgoing calls on sp1

UCD-SNMP-MIB::extNames.25

.100.25

.101.25

kam_dialog_type_outgoing_check_sp2

Number of outgoing calls on sp2

UCD-SNMP-MIB::extNames.26

.100.26

.101.26

kam_usrloc_regusers_check_sp1

Number of subscribers with at least one active registration on sp1

UCD-SNMP-MIB::extNames.27

.100.27

.101.27

kam_usrloc_regusers_check_sp2

Number of subscribers with at least one active registration on sp2

UCD-SNMP-MIB::extNames.28

.100.28

.101.28

kam_usrloc_regdevices_check_sp1

Total number of registered end devices on sp1

UCD-SNMP-MIB::extNames.29

.100.29

.101.29

kam_usrloc_regdevices_check_sp2

Total number of registered end devices on sp2

UCD-SNMP-MIB::extNames.30

.100.30

.101.30

mysql_replication_discrepancies_check_sp1

Number of MySQL tables not in sync between sp1 and sp2

UCD-SNMP-MIB::extNames.31

.100.31

.101.31

mysql_replication_discrepancies_check_sp2

Number of MySQL tables not in sync between sp1 and sp2

UCD-SNMP-MIB::extNames.32

.100.32

.101.32

sip_check_self

Summarized platform check on active node

UCD-SNMP-MIB::extNames.33

.100.33

.101.33

mysql_check_self

Average number of MySQL queries per second on active node

UCD-SNMP-MIB::extNames.34

.100.34

.101.34

mysql_replication_check_self

MySQL replication delay in seconds on active node

UCD-SNMP-MIB::extNames.35

.100.35

.101.35

mpt_check_self

RAID status on active node

UCD-SNMP-MIB::extNames.36

.100.36

.101.36

exim_queue_check_self

Number of mails undelivered in MTA queue on active node

UCD-SNMP-MIB::extNames.37

.100.37

.101.37

provisioned_subscribers_check_self

Number of subscribers provisioned on active node

UCD-SNMP-MIB::extNames.38

.100.38

.101.38

kam_dialog_active_check_self

Number of active calls on active node

UCD-SNMP-MIB::extNames.39

.100.39

.101.39

kam_dialog_early_check_self

Number of calls in Early Media state on active node

UCD-SNMP-MIB::extNames.40

.100.40

.101.40

kam_dialog_type_local_check_self

Number of active calls local on active node

UCD-SNMP-MIB::extNames.41

.100.41

.101.41

kam_dialog_type_relay_check_self

Number of active calls routed via peers on active node

UCD-SNMP-MIB::extNames.42

.100.42

.101.42

kam_dialog_type_incoming_check_self

Number of incoming calls on active node

UCD-SNMP-MIB::extNames.43

.100.43

.101.43

kam_dialog_type_outgoing_check_self

Number of outgoing calls on active node

UCD-SNMP-MIB::extNames.44

.100.44

.101.44

kam_usrloc_regusers_check_self

Number of subscribers with at least one active registration on active node

UCD-SNMP-MIB::extNames.45

.100.45

.101.45

kam_usrloc_regdevices_check_self

Total number of registered end devices on active node

UCD-SNMP-MIB::extNames.46

.100.46

.101.46

mysql_replication_discrepancies_check_self

Number of MySQL tables not in sync between sp1 and sp2

[Tip]

Some of the checks can be disabled (and some are disabled by default) through the config.yml file, and those checks will then return an error message or an empty string in their extOutput. Enable those checks in the config file to get their output in the SNMP OID tree. The enable/disable flags can be found in the checktools section.