The platform uses the internal monit service to monitor all essential
services. Since the sip:provider PRO runs in an active/standby mode, not
all services are always running on both nodes, some of them will only run
on the active node and be stopped on the standby node. The following commands
show the most critical services on the platform:
* monit summary
- to get the list of services and their current status,
* monit status
- to get the list of services with detailed status.
important | |
When you perform a stop/start/monitor/unmonitor operation on a service, monit affects other services that depend on the initial one. Hence, if you stop or unmonitor a service all services that depend on it will be stopped or unmonitored as well. |
For example, monit stop mysql
operation will stop kamailio, sbc, asterisk,
prosody and some other services. Although the recommended way to operate on
services is via the ngcp-services
wrapper which will take care of
abstracting the underlying process monitoring implementation.
If any service ever fails for whatever reason the monit daemon quickly
restarts it. When that happens, the daemon will send a notification email to
the address specified in the config.yml
file under the general.adminmail
key. It will also send warning emails to this address under certain abnormal
conditions, such as high memory consumption (> 75% is used) or high CPU load.
important | |
In order for monit to be able to send emails to the specified
address, the local MTA (exim4) must be configured correctly. If you
haven’t done so already, run |
The platform uses the internal telegraf service to monitor many aspects of the system, including CPU, memory, swap, disk, filesystem, network, processes, NTP, Nginx, Redis and MySQL.
The gathered information is stored in InfluxDB, in the telegraf database.
The platform uses the internal ngcp-witnessd service to monitor NGCP-specific metrics or system metrics currently not tracked by telegraf, including memory, process count, Heartbeat, MTA, Kamailio, SIP and MySQL.
The gathered information is stored in InfluxDB, in the ngcp database.
The platform uses InfluxDB as a time series database, to store most of the metrics collected in the system.
On a sip:provider PRO each node stores its own metrics and the ones for their peer node. This is done via influxdb-relay which listens for InfluxDB writes and multiplexes them to the local node and any other node necessary.
The monitoring data is used by various components of the platform, including ngcp-collective-check, ngcp-snmp-agent and by the statistics dashboard powered by Grafana.
The monitoring data can also be accessed directly by various means; by using the influx command-line tool in CLI or TUI modes; by using the ngcp-influxdb-extract wrapper which provides two convenience commands to run arbitrary queries or to fetch the last value for a measurement’s field; or by using the HTTP API with curl (or other HTTP fetchers), or with the Sipwise::InfluxDB::HTTP perl module.
See https://docs.influxdata.com/influxdb/v1.1/query_language/spec/ for information about InfluxQL, the query language used by InfluxDB.
tip | |
To get the list of all measurements for a specific database the following
query can be used |
tip | |
To get the list of fields for a specific measurement the following query
can be used |
tip | |
To get the list of tags for a specific measurement the following query
can be used |
See Section 2.1, “InfluxDB monitoring keys” for detailed information about the list of data currently stored in the InfluxDB ngcp monitoring database.
The platform’s administration interface (described in Section 5, “VoIP Service Configuration Scenario”) provides a graphical overview based on Grafana of the most important system health indicators, such as memory usage, load averages and disk usage. VoIP statistics, such as the number of concurrent active calls, the number of provisioned and registered subscribers, etc. is also present.
The sip:provider PRO exports a variety of cluster health data and statistics
over the standard SNMP interface. By default, the SNMP interface can only be
accessed locally. To make it possible to provide the SNMP data to an external
system, the config.yml
file needs to be edited and the list of allowed
community names and allowed hosts/IP ranges must be populated. This list can be
found under the checktools.snmpd.communities
key and it consists of one or
more community
/source
value pairs. The community
is the allowed
community name, while source
is an IP address or an IP block where to allow
the requests from.
The SNMP notifications can also be configured in a similar way, to send them
to an external system, by populating the checktools.snmpd.trap_communities
key with community
/target
value pairs. The community
is the value
that will be used when sending the trap, while the target
is an IP address
where to send the trap.
The public
entries with the localhost
source and target are used for
local testing of SNMP functionality. It is recommended that you leave these
entries in place. Other legal sources
can be formed as single IP addresses
or IP blocks in IP/prefix notation, for example 192.168.115.0/24
. Other
targets
can be formed as single IP addresses.
tip | |
To locally check if SNMP is working correctly, execute the command
|
tip | |
To locally check if SNMP notifications (or traps) are working correctly,
install the snmptrapd package, which will be configured by default to
catch the traps sent by the localhost SNMP agent. The traps will show up on
|
INFO: SNMP version 1 and version 2c are supported.
There are two types of information that can be retrieved from SNMP. The first one is the native NGCP cluster overview from the Sipwise MIBs (Management Information Bases). The second is the legacy ad-hoc information using the Net-SNMP extension OIDs, and detailed information for the node running the SNMP daemon using standard OIDs (Object Identifiers).
The entire NGCP cluster can be monitored by using the SIPWISE-NGCP-MIB
,
SIPWISE-NGCP-MONITOR-MIB
and SIPWISE-NGCP-STATS-MIB
. These OIDs are
rooted at the Sipwise NGCP slot .1.3.6.1.4.1.34274.1.*
.
The MIBs are self-documented, and can be found as part of the
ngcp-snmp-mibs package (running dpkg -S SIPWISE*MIB
will list their
pathnames). The NGCP SNMP Agent is a part of the
ngcp-snmp-agent package, which is installed by default and works
out-of-the-box as long as the snmpd has been properly configured.
The SIPWISE-NGCP-MIB
acts as the root MIB and provides information
about the cluster licensing and layout (which is mostly static data about
each node, such as node name, its IP address, its roles, etc.) and information
required to access the OIDs from the other MIBs.
The SIPWISE-NGCP-MONITOR-MIB
provides current monitoring information,
global health conditions, the number of provisioned and registered subscribers
and devices. It also provides per node information (independently of the number
of nodes or their names) on their filesystem, processes, databases, system load,
memory, heartbeat status, MTA queues, etc.
The SIPWISE-NGCP-STATS-MIB
provides accumulated statistics on billing,
performance and processed SIP messages.
NOTICE: OIDs under the following trees are not yet implemented: ngcpMonitorFraud, ngcpMonitorPerformance.perfCAPSCurTable and ngcpStats.
INFO: The NGCP SNMP Agent uses Redis and InfluxDB as data sources. This data is essential for accurate and complete monitoring data in the SNMP OID tree. In addition, the Redis database must be available on a shared IP address, so that ngcp-witnessd can always write to it.
info | |
The following OIDs have been superseded by the Sipwise NGCP OIDs, but they are still provided for backwards compatibility. |
All basic system health variables (such as memory, disk, swap, CPU usage,
network statistics, process lists, etc.) for the mgmt node can be found
in standard OID slots from standard MIBs. For example, memory statistics
can be found through the UCD-SNMP-MIB in OIDs such as memTotalSwap.0
,
memAvailSwap.0
, memTotalReal.0
, memAvailReal.0
, etc., which
translate to numeric OIDs .1.3.6.1.4.1.2021.4.*
. In fact,
UCD-SNMP-MIB
is the most useful MIB for overall system health checks.
Additionally, there’s a list of specially monitored processes, also
found through the UCD-SNMP-MIB
. UCD-SNMP-MIB::prNames
(.1.3.6.1.4.1.2021.2.1.2
) gives the list of monitored processes,
prCount
(.1.3.6.1.4.1.2021.2.1.5
) is how many of each process are
running and prErrorFlag
(.1.3.6.1.4.1.2021.2.1.100
) gives a 0/1
error indication (with prErrMessage
(.1.3.6.1.4.1.2021.2.1.101
)
providing an explanation of any error).
tip | |
Some of these processes are not supposed to be running on the standby node, so you’ll see the error flag raised there. A possible solution is to run these SNMP checks against the shared service IP of the cluster. See in Section 2.4, “High Availability and Fail-Over” below for more information. |
Furthermore, UCD-SNMP-MIB
provides a list of custom external checks.
The names of these can be found under the UCD-SNMP-MIB::extNames
(.2
) tree, with extOutput
(.101
) providing the output (one
line) from each check and extResult
(.100
) the exit code from
each check.
The first of these external checks called collective_check
provides
a combined and overall system health status indicator. It gathers
information from both nodes and returns 0 in extResult.1
(.100.1
) if everything is OK and running as it should. If it finds
a problem somewhere, but with the system still operational (e.g. a
service is stopped on the inactive node), extResult.1
will return
1 and extOutput.1
will be set to a string that can be used to
diagnose the problem. In case the system is found in a critical and
non-operational state, extResult.1
will return 2, again with
an error message set. If you want to keep it really simple, you can
just monitor this one OID and raise an alarm if it ever goes to non-zero.
INFO: The 0/1/2 status codes allow for easy integration with Nagios.
The remaining external checks simply return statistics on the system,
they all return a number in extOutput
and have extResult
always
set to zero.
The full list of such checks is below. All of these checks have three modes:
the first returns the statistics from sp1
(the first node in
the sip:provider PRO pair), the second - from sp2
,
and the third - from whichever node is being queried (which is useful when
querying the shared service IP). For example, the local SIP response time from
sp1
is in sip_check_sp1
, from sp2
- is in sip_check_sp2
, and
from the host itself - is in sip_check_self
.
The base OID of the Result and Output OIDs is always .1.3.6.1.4.1.2021.8.1
,
so if you read .100.1
, the full OID is .1.3.6.1.4.1.2021.8.1.100.1
.
Name in MIB | Result OID | Output OID | Name | Description |
---|---|---|---|---|
UCD-SNMP-MIB::extNames.1 | .100.1 | .101.1 | collective_check | Summarized platform check |
UCD-SNMP-MIB::extNames.2 | .100.2 | .101.2 | sip_check_sp1 | SIP response time in seconds on sp1 |
UCD-SNMP-MIB::extNames.3 | .100.3 | .101.3 | sip_check_sp2 | SIP response time in seconds on sp2 |
UCD-SNMP-MIB::extNames.4 | .100.4 | .101.4 | mysql_check_sp1 | Average number of MySQL queries per second on sp1 |
UCD-SNMP-MIB::extNames.5 | .100.5 | .101.5 | mysql_check_sp2 | Average number of MySQL queries per second on sp2 |
UCD-SNMP-MIB::extNames.6 | .100.6 | .101.6 | mysql_replication_check_sp1 | MySQL replication delay in seconds on sp1 |
UCD-SNMP-MIB::extNames.7 | .100.7 | .101.7 | mysql_replication_check_sp2 | MySQL replication delay in seconds on sp2 |
UCD-SNMP-MIB::extNames.8 | .100.8 | .101.8 | mpt_check_sp1 | RAID status on sp1 |
UCD-SNMP-MIB::extNames.9 | .100.9 | .101.9 | mpt_check_sp2 | RAID status on sp2 |
UCD-SNMP-MIB::extNames.10 | .100.10 | .101.10 | exim_queue_check_sp1 | Number of mails undelivered in MTA queue on sp1 |
UCD-SNMP-MIB::extNames.11 | .100.11 | .101.11 | exim_queue_check_sp2 | Number of mails undelivered in MTA queue on sp2 |
UCD-SNMP-MIB::extNames.12 | .100.12 | .101.12 | provisioned_subscribers_check_sp1 | Number of subscribers provisioned on sp1 |
UCD-SNMP-MIB::extNames.13 | .100.13 | .101.13 | provisioned_subscribers_check_sp2 | Number of subscribers provisioned on sp2 |
UCD-SNMP-MIB::extNames.14 | .100.14 | .101.14 | kam_dialog_active_check_sp1 | Number of active calls on sp1 |
UCD-SNMP-MIB::extNames.15 | .100.15 | .101.15 | kam_dialog_active_check_sp2 | Number of active calls on sp2 |
UCD-SNMP-MIB::extNames.16 | .100.16 | .101.16 | kam_dialog_early_check_sp1 | Number of calls in Early Media state on sp1 |
UCD-SNMP-MIB::extNames.17 | .100.17 | .101.17 | kam_dialog_early_check_sp2 | Number of calls in Early Media state on sp2 |
UCD-SNMP-MIB::extNames.18 | .100.18 | .101.18 | kam_dialog_type_local_check_sp1 | Number of active calls local on sp1 |
UCD-SNMP-MIB::extNames.19 | .100.19 | .101.19 | kam_dialog_type_local_check_sp2 | Number of active calls local on sp2 |
UCD-SNMP-MIB::extNames.20 | .100.20 | .101.20 | kam_dialog_type_relay_check_sp1 | Number of active calls routed via peers on sp1 |
UCD-SNMP-MIB::extNames.21 | .100.21 | .101.21 | kam_dialog_type_relay_check_sp2 | Number of active calls routed via peers on sp2 |
UCD-SNMP-MIB::extNames.22 | .100.22 | .101.22 | kam_dialog_type_incoming_check_sp1 | Number of incoming calls on sp1 |
UCD-SNMP-MIB::extNames.23 | .100.23 | .101.23 | kam_dialog_type_incoming_check_sp2 | Number of incoming calls on sp2 |
UCD-SNMP-MIB::extNames.24 | .100.24 | .101.24 | kam_dialog_type_outgoing_check_sp1 | Number of outgoing calls on sp1 |
UCD-SNMP-MIB::extNames.25 | .100.25 | .101.25 | kam_dialog_type_outgoing_check_sp2 | Number of outgoing calls on sp2 |
UCD-SNMP-MIB::extNames.26 | .100.26 | .101.26 | kam_usrloc_regusers_check_sp1 | Number of subscribers with at least one active registration on sp1 |
UCD-SNMP-MIB::extNames.27 | .100.27 | .101.27 | kam_usrloc_regusers_check_sp2 | Number of subscribers with at least one active registration on sp2 |
UCD-SNMP-MIB::extNames.28 | .100.28 | .101.28 | kam_usrloc_regdevices_check_sp1 | Total number of registered end devices on sp1 |
UCD-SNMP-MIB::extNames.29 | .100.29 | .101.29 | kam_usrloc_regdevices_check_sp2 | Total number of registered end devices on sp2 |
UCD-SNMP-MIB::extNames.30 | .100.30 | .101.30 | mysql_replication_discrepancies_check_sp1 | Number of MySQL tables not in sync between sp1 and sp2 |
UCD-SNMP-MIB::extNames.31 | .100.31 | .101.31 | mysql_replication_discrepancies_check_sp2 | Number of MySQL tables not in sync between sp1 and sp2 |
UCD-SNMP-MIB::extNames.32 | .100.32 | .101.32 | sip_check_self | Summarized platform check on active node |
UCD-SNMP-MIB::extNames.33 | .100.33 | .101.33 | mysql_check_self | Average number of MySQL queries per second on active node |
UCD-SNMP-MIB::extNames.34 | .100.34 | .101.34 | mysql_replication_check_self | MySQL replication delay in seconds on active node |
UCD-SNMP-MIB::extNames.35 | .100.35 | .101.35 | mpt_check_self | RAID status on active node |
UCD-SNMP-MIB::extNames.36 | .100.36 | .101.36 | exim_queue_check_self | Number of mails undelivered in MTA queue on active node |
UCD-SNMP-MIB::extNames.37 | .100.37 | .101.37 | provisioned_subscribers_check_self | Number of subscribers provisioned on active node |
UCD-SNMP-MIB::extNames.38 | .100.38 | .101.38 | kam_dialog_active_check_self | Number of active calls on active node |
UCD-SNMP-MIB::extNames.39 | .100.39 | .101.39 | kam_dialog_early_check_self | Number of calls in Early Media state on active node |
UCD-SNMP-MIB::extNames.40 | .100.40 | .101.40 | kam_dialog_type_local_check_self | Number of active calls local on active node |
UCD-SNMP-MIB::extNames.41 | .100.41 | .101.41 | kam_dialog_type_relay_check_self | Number of active calls routed via peers on active node |
UCD-SNMP-MIB::extNames.42 | .100.42 | .101.42 | kam_dialog_type_incoming_check_self | Number of incoming calls on active node |
UCD-SNMP-MIB::extNames.43 | .100.43 | .101.43 | kam_dialog_type_outgoing_check_self | Number of outgoing calls on active node |
UCD-SNMP-MIB::extNames.44 | .100.44 | .101.44 | kam_usrloc_regusers_check_self | Number of subscribers with at least one active registration on active node |
UCD-SNMP-MIB::extNames.45 | .100.45 | .101.45 | kam_usrloc_regdevices_check_self | Total number of registered end devices on active node |
UCD-SNMP-MIB::extNames.46 | .100.46 | .101.46 | mysql_replication_discrepancies_check_self | Number of MySQL tables not in sync between sp1 and sp2 |
tip | |
Some of the checks can be disabled (most are enabled by default)
through the |