RTSP¶

RTSP is a protocol that allows to connect two endpoints and establish directional ultra low latency audio/video stream between them.

Today it is mostly used in IP cameras due to historical reasons. It is a well designed, highly extensible protocol that has enough features and capabilities to be used today and hasn't become outdated during his 30 years life. We are mad about ultra-zero-low-minimal latency nowadays, so this protocol is still actual.

RTSP can be compared to SIP and WebRTC. SIP is very similar to RTSP, but it is used for controlling bidirectional (or more complicated topology) and is used in IP telephony or video conferencing systems. WebRTC is similar to SIP, but the text signalling protocol is replaced with non-specified HTTP way to exchange information. However WHEP/WHIP extensions for WebRTC are a direct replacement for RTSP.

Comparing to MPEG-TS, RTSP is focused on audio/video delivery via IP, while MPEG-TS is focused on providing TV services on single-directional media.

Flussonic supports:

Initiator	Direction	Description	Usage
Flussonic	Inside	ingest video from IP cameras via RTSP	Most often usage in video surveillance
ffmpeg	Inside	accept publish from ffmpeg via RTSP (we haven't seen any other software doing it)	Rarely used
external VMS	Outside	playback to external clients via RTSP (usually for video surveillance needs)	Other VMS systems
Flussonic	Outside	pushing to other server (usually flussonic) via RTSP	Rarely used

Sometimes it is possible to meet RTSP/2.0, but it is something mentioned, but not met in wild world. All are using different flavors of RTSP/1.0

There are two different incompatible flavors of RTSP in wild nature:

IP camera RTSP that transmits video and audio in different UDP/interleaved TCP streams. Usually unicast.
IPTV RTSP that transmits video and audio inside MPEG-TS, which is encapsulated inside RTP packets, sent via multicast.

Flussonic has full support for the first option and limited support for the second (only ingest).

RTSP, SDP, RTP and RTCP¶

Word RTSP implies using following standards:

RTSP as a text signalling protocol. It looks just like HTTP and is very similar to it.
SDP is a format for one text message that is transmitted from video source to video destination and tells about content and ways to get it
RTP is a binary protocol for transmitting video, audio, metadata via UDP or TCP in the same socket as RTSP goes.
RTCP is a binary bidirectional protocol for exchanging control messages related to the transmitted RTP.

RTSP¶

While HTTP can live with only two verbs (GET and POST) it is extended with list of business-level verbs for WebDAV like MKCOL, MOVE.

RTSP has a dozen of well known verbs that are used for logic level. Most common list is: OPTIONS/GET_PARAMETER, DESCRIBE, ANNOUNCE, SETUP, PLAY, RECORD, TEARDOWN

Initiator of the connection is sending requests and receiving responses. Terms like client or server here are not convenient, because it is not very clear, who is the client and the server when two servers are connecting each other.

For example when Flussonic is reading video from RTSP camera, it tries to send following requests:

> OPTIONS rtsp://192.168.1.100/h264 RTSP/1.0
> Www-Authenticate: Basic c2VjcmV0
> CSeq: 1
>
< RTSP/1.0 200 OK
< CSeq: 1 
<
> DESCRIBE rtsp://192.168.1.100/h264 RTSP/1.0
> Www-Authenticate: Basic c2VjcmV0
> CSeq: 2
>
< RTSP/1.0 200 OK
< CSeq: 2
< Content-Type: application/sdp
< Content-Length: 370
<
.... here goes SDP
....
> SETUP rtsp://192.168.1.100/h264/trackID=1 RTSP/1.0
> Www-Authenticate: Basic c2VjcmV0
> CSeq: 3
>
< RTSP/1.0 200 OK
< CSeq: 3
<
> PLAY rtsp://192.168.1.100/h264/trackID=1 RTSP/1.0
> Www-Authenticate: Basic c2VjcmV0
> CSeq: 4
>
< RTSP/1.0 200 OK
< CSeq: 4
<
> GET_PARAMETER rtsp://192.168.1.100/h264 RTSP/1.0
> Www-Authenticate: Basic c2VjcmV0
> CSeq: 5
>
< RTSP/1.0 200 OK
< CSeq: 5
<

This is a very limited example, just to give you a brief idea of what is happening here.

Mention that you need to know full rtsp url with path: rtsp://192.168.1.100/h264. If you do not know path on this server (/h264 part), you will not be able to play anything.

Problem of RTSP url discovery is covered with Onvif

You can see strange call GET_PARAMETER after PLAY. Nobody is going to get any parameters here, it is just a void keepalive call to tell that initiator is still alive and still wants to send/receive video.

This looks rather insane when you use the same socket to send video, but usually code is organized so, that if you send megabits of video per second, send RTCP packets, but do not send GET_PARAMETER, connection will be terminated.

SDP¶

Session Description Protocol, as you can guess, usually does not describe any session. It describes media and sometimes can describe the way to receive/send this media.

Example of SDP:

v=0
o=- 1273580251173374 1273580251173374 IN IP4 axis-00408ca51334.local
s=Media Presentation
e=NONE
c=IN IP4 0.0.0.0
b=AS:50000
t=0 0
a=control:*
a=range:npt=0.000000-
m=video 0 RTP/AVP 96
b=AS:50000
a=framerate:30.0
a=control:trackID=1
a=rtpmap:96 H264/90000
a=fmtp:96 packetization-mode=1; profile-level-id=420029; sprop-parameter-sets=Z0IAKeNQFAe2AtwEBAaQeJEV,aM48gA==

This trivial SDP has enough information to configure the decoder (sprop-parameter-sets) and establish playback (a=control: field).

Yes, this self-explanatory and human readable text brings a question: why not just JSON? Because it was standardized years before JSON raise and it is a great luck that it is not an XML.

Like in many other protocols, lot of fields in SDP are useless and don't change anything, but when you write a widely used software, it is very hard to guess which exactly fields are useless.

RTP¶

Real Time Protocol is the main protocol that is used for delivery of audio and video. It is binary and rather simple.

Each packet contains:

track ID (all audio and video tracks are delivered separately)
sequence ID for controlling packet drops, reordering, retransmit, etc.
timecode of this packet
optional extensions

RTP can be so convenient, that it is even used to carry MPEG-TS (that also has continity counter and timestamps) in IP networks. This is used for retransmitting lost packets.

RTP is focused on using UDP with maximum control of delivery in userspace. This is why RTP packets are usually limited to something around 1400 bytes which allows them to be transmitted without fragmentation. This size is too big for audio frames and too small for video.

All specification for packing audio/video inside RTP offers some kind of fragmentation and aggregation.

The idea of fragmentation is simple: if you lose 1400 bytes in the middle of 30KB frame, it is possible to recover this loss or even restore this data. If you lose 30KB UDP packet, it is almost impossible to recover it.

RTP uses 2 bytes for one packet size, so it is impossible to make packet larger than 64KB. High resolution video streams have keyframes bigger than this limit, so fragmentation is mandatory.

Aggregation of several audio packets in the single frame is not so widely used, because it brings extra latency and this is why it is not so interesting. We are ok about wasting 80% of traffic, just not to increase latency in 5 times.

RTCP¶

Real Time Control Protocol is a secret spice of RTSP. This protocol looks a bit like RTP (but they are different and cannot be parsed by the same code) and allows bidirectional exchange of real time statistics:

Convertion between rtp timecode and NTP wallclock. Yes, RTSP assumes that you have absolute timestamps of each frame and it is extremely cool.
Exchanging amount of bytes and packets that were sent and received. Some RTSP sources can change their bitrate if they see growing jitter or packet loss on client. Just like WebRTC does today
Requests for packet retransmits
Other cool video delivery things.

This protocol shares features between RTSP, SIP and WebRTC. WebRTC is using most complicated subset of RTCP features.

RTSP in IP cameras usually is using only Sender Report (code 200) and Receiver Report (code 201) kinds of packets. Both are mandatory, usually nothing will work if you ignore them, other kinds are very rarely met.

Standards¶

Following standards are implemented in Flussonic for RTSP:

Standard	Purpose
RFC2326	Basic document for RTSP sessions
RFC2327, RFC3266, RFC4566	SDP explanation
RTC1889, RTC3550	RTP
RFC3984,RFC6184	H264 inside RTP
RFC3016, RFC6416	AAC inside RTP
RFC2035, RFC2435	JPEG over RTP
RFC7798	HEVC inside RTP
RFC7587	Opus inside RTP
RFC1890, RFC3551	PCMA inside RTP

RFC7826 RTSP/2.0 specification tries to obsolete RTSP/1.0, but it is too early to speak about it.

DVR support for RTSP¶

It is possible to play DVR via RTSP. Though we do not recomment using frame protocols for DVR (dash is much better for it), this may be a strict requirement for video surveillance integration which is using only RTSP.

Flussonic fully support DVR playback via RTSP.

It is implemented using following features:

SDP has range field that is used to transmit list of contigious recording periods
Range: clock=20230919T084734Z-20230919T084914Z header in PLAY method
GET_PARAMETER with position in body will return current playback timestamp

Onvif¶

Onvif is a HTTP XML (and a bit UDP) discovery protocol that can help to configure IP camera and find RTSP urls.

Read more about it in Onvif protocol article