SIP monitoring and analysis
Operating a VoIP system that focuses on great customer experience can be quite challenging. This is especially apparent if you run a heterogeneous network with lots of different SIP clients (like various software clients, all kinds of SIP Phones and Terminal Adapters and especially IP PBXs). SIP clients are known to have all kinds of quirks and implementation errors. If you don’t control them yourselves (e.g. with a central device provisioning tool), then the additional factor of configuration errors introduced by your customers comes into play.
Putting the right values into the configuration interface of the clients is not always straightforward. It sometimes needs an engineering degree to find out what’s up with parameters like registrar, outbound-proxy, session-timers, codec ordering etc. Flexibility is not always key, especially when it comes to ending user interfaces. That is why Skype is so successful, because “it just works”.
Anyways, if a customer uses your VoIP service (especially if it’s a paid service), it just needs to work, and if not, you better pin down the error cause as soon as possible and provide a solution to the customer, otherwise he’ll turn away from you quite quickly.
The poor man’s approach
In the past, VoIP troubleshooting went somewhere along this line (we’ve been there and done that):
- Ask the customer when approximately she did the failed call or failed to register her phone
- Grep the (hopefully extensive) log files for hints pointing to the error
- If nothing obvious comes up there, start a tcpdump on the system and ask the customer to try the call again
- Copy the resulting trace to your local machine and try to extract the relevant packages from a potentially HUGE trace
- Analyse the call, take your actions, and if necessary repeat the process
This approach has some obvious flaws. First, your support agent needs access directly on the system and the proper rights to start a trace. It is also quite time-consuming. It probably does not give a professional impression if you needed to ask your customer for some action in order for you to find the problem. It’s also a heavily manual process, requires quite some technical expertise to pull off, and if the support agent needs to escalate the issue to 2nd Level Support, it involves uploading SIP traces to somewhere, or even worse, sending them back and forth by email.
External Monitoring Tools to the rescue!
Due to the huge overhead of the traditional troubleshooting approach, a whole new ecosystem around external SIP monitoring and analysis. New start-ups were created to tackle these issues and established network monitoring vendors pushed into the market, providing traffic analyser solutions to ease the pain of VoIP support. The problem for small VoIP operators is that these solutions can be horrendously expensive. In the telephony industry, licensing models are broken down to a per-line or per-subscriber price, and it’s not uncommon that the line price of the analyser tools exceeds the line price of the VoIP soft-switch, which is just unfeasible.
However, since open source projects increasingly get their feet into the VoIP market, it’s quite natural that also open source VoIP monitoring and troubleshooting tools start to appear. The most promising project in the open source landscape is Homer, an open source SIP capturing server. Since it can passively wiretap traffic on mirrored switch ports, it integrates nicely into a VoIP network environment without interfering with existing networking elements.
Using such tools, the support process changes significantly, because all SIP packets are constantly captured on the network and can be filtered and viewed on web interfaces. Most of them, like Homer, also visually present the call flows of the SIP packets, so it gets very easy to spot issues between the involved hops.
Instead of having to involve the customer in the troubleshooting process, it becomes something like this:
- Filter for calls or registrations of the respective customer
- Visually check the call flows and packets for obvious issues
- If necessary, grep the logs for specific calls
- Take actions and repeat the process if necessary
If more people need to be involved in the troubleshooting process, just the link to the call flow in question needs to be shared.
However, the problem with such tools is that they can only provide an external view of a VoIP system because in most of the cases it’s not possible to hook into the internal communication of a VoIP soft-switch. For example the Sipwise sip:provider appliances consist of several SIP elements communicating with each other on the local interface, and this traffic can’t be captured without “opening up” the soft-switch and install additional software onto it, which might either be impossible at all, or might void any warranties provided by the vendor.
The Sipwise Approach
To get a complete view of the SIP packet flows also inside of the VoIP system, we have integrated the first version of our own SIP monitoring and troubleshooting system into the upcoming version 2.6 of the sip:providerPRO platform. It provides deep insights into the past and current call flows by lining out a break-down of SIP requests and responses, as well as visual call graphs and packet details. The advantage to external solutions is that it integrates tightly into the existing Administrative Interface.
An overview of the amount and distribution of various requests and responses gives you great hints for failure predictions. We’re working hard to also implement trending and predictions of issues, so countermeasures can be taken in a pro-active approach before complaints start hitting your support team.
To troubleshoot customer issues, all call scenarios are listed directly in the subscriber view, so you don’t have to search for calls belonging to specific customers.
Each call scenario provides a dynamically rendered graphical representation of the call flow, so you can easily spot any issues in the call routeing directly on a network level.
The call scenario is clickable, so you can easily dig into the details of a specific packet.
For further analysis, you can also download the raw SIP trace in PCAP format.
What’s next?
We’ve learned that it is extremely important to provide a very simple way to get an overview of what is going on at any given moment directly on a networking level because log files don’t always provide all the information needed to troubleshoot an issue. It is crucial to be able to analyse calls which happened in the past, so you don’t have to bother a customer with any actions during the troubleshooting process.
One major task to tackle is also to counteract on arising issues before they pile up, and most importantly before customers get affected. Our focus will be to extend the described tools to predict issues where possible, so you’ll be able to react before problems escalate.