Media streaming via WebRTC
Media streaming is nothing new on the web (you guys heard of Youtube?). However, it is not included as one of the interesting use cases for WebRTC that we usually see quite often, such as plain P2P communication and multi-party video conferencing, and server-side gesture detection, etc.
With WebRTC, media streaming from server to clients (e.g. live streaming and video-on-demand) is possible. You can actually get a two-way media connection with both audio and video, plus your two-way signalling channel for other data. Just imagine what you could do on the server with all the audio and video coming in from your watchers.
Since we always love to experiment with cutting-edge technology at Sipwise, here is a concept for a WebRTC based media streaming platform, based purely on open source software. Since it’s using plain SIP, you can watch streams with normal SIP clients like Jitsi on your PC/Mac or CounterPath Bria on your Phone/Tablet too!
The Ingredients
Streaming media to a browser via WebRTC requires you to deliver an audio stream encoded with Opus (or G711, which is not really a viable option due to the quality), and a video stream encoded with VP8 (or probably H264 in the future), both encrypted via DTLS-SRTP. So, on one hand, you need a signalling server negotiating the audio and video codecs, and on the other hand, you need a media engine, transcoding the streams to the requested codecs and encrypting it with the keys, which are negotiated in-band within the media stream.
Now the idea is to be able to simply call a subscriber (e.g. stream@example.org) via SIP and get the content streamed to your client.
Since we’ve quite some tools at hand at Sipwise, the choice is fairly simple, and the architecture of our WebRTC streaming platform is going to look like this:
Let’s go over the different components and their interaction with each other:
Signalling Server
For signalling, we’re going to use the Sipwise Sip:provider CE. It’s an open source VoIP soft-switch allowing us to communicating with SIP clients over WebSockets via the integrated Kamailio SIP proxy. We can also manage the users via the web interfaces and even do billing based on the duration a user is watching the stream.
We can establish media sessions both via SIP over Websockets and via normal UDP, TCP or TLS, which gives us great flexibility in hooking up different types of clients. For WebRTC in particular, we need a SIP stack in javascript, and we’re going to use tryit.jssip.net as a readily available SIP client for WebRTC.
Media Engine
Part of the Sipwise Sip:provider CE is the rtpengine, which is a media proxy for Kamailio, developed by Sipwise. It supports transcoding DTLS-SRTP streams to normal RTP and vice versa, so we don’t need to care about the crypto part in our application server, which is going to deliver the streams.
Application Server
Our application server will be the called party in the signalling stream. To make the overall architecture as simple and non-intrusive as possible for existing components, we’re just going to register it as a normal subscriber to the SIP platform. That way, you can create as many subscribers as you want with any name you want (e.g. stream@example.org, livetv@example.org, thehobbit@example.org or whatever), and register an application server for each of them, serving a particular stream or movie.
Once this subscriber is called by the SIP client of the viewer, it needs to inspect the SDP body of the INVITE message to figure out the list of supported audio and video codecs. It needs to choose a possible codec combination (we decide to support H264 and VP8 for video, and Opus and G711a for audio) and pass this information along with the media ips and ports of the caller to the media transcoder, so it can start streaming to the client.
When the call is hung up, the application server needs to tear down the streaming session and clean up after it.
For that particular building block, we decided to build our own very simple user agent in Perl based on the Net:SIP module.
Media Transcoder
The most interesting question in the planning phase was how to hook up an arbitrary media stream to our SIP session in the most simple and flexible way. Our choice was VideoLan Client (VLC), as it satisfied most of the requirements: supporting the audio and video codecs we need, be able to read all kinds of media sources and stream them via RTP to a recipient. VLC also generates SDP for RTP streaming, but it’s not suited for SDP offer/answer, so we had to patch VLC to be able to control the RTP streams in a more fine-grained way.
Using VLC, you can stream any kind of media sources, like movie files, live streams provided from somewhere else via RTSP, or even TV streams by hooking up a DVB receiver (e.g. a PCI card or even a USB stick) to your server. Yay to Live-TV over SIP!
The way we’re controlling VLC is starting it with the telnet interface, and controlling the streams via telnet commands from the application server.
The Result
Based on the architecture outlined above, we came up with a working implementation pretty quickly, and it’s working really well both on Chrome and Firefox using WebRTC. Due to the signalling being based on SIP, it works out of the box using Jitsi too.
Conclusion
Although WebRTC doesn’t dictate any signalling protocol, it’s good to base the foundation on standard protocols, especially if you want to integrate your service into existing infrastructures and/or want to re-use readily available tools.
By splitting up the functional requirements into clearly defined parts, you can easily have them handled by readily available tools, and end up by only building the missing interfaces (and slight adoptions to your tools) to make them work together.
Another advantage is that you can split the functional parts physically in an easy way (as they are communicating over over the network anyways), and therefore scale it without too much pain.
What’s next
Due to the two-way communication channel you’re creating with your viewers, you can get user feedback both in audio and video back to the streaming server in real-time, and possibly have chatted via the signalling channel too. This opens up quite a lot of possibilities and room for creativity.
Any ideas for fun and interesting use case?