What is a Video Pipeline?
IT specialists are familiar with the concept of “pipeline”:
- System administrators know how to assemble a chain of console utilities into a “powerful” one-line command. The output of one command flows to the input of another program, forming a seamless data processing pipeline.
- Developers assemble a CI/CD describing the way to build and deliver the application to test and production servers.
- Network specialists begin to study the work of the network with how TCP/IP and OSI protocols work. Then they assemble their own networks, choosing the necessary equipment, physical communication channels, and protocols. The network is also a “conveyor” of data transmission: on the one hand it entered, on the other it left.
These Wikipedia article describes other types of pipelines: in computing, graphics, sales, logistics, any industrial production, and even water pipes.
Video transmission is also a conveyor. In our work, we often mention this term - we introduce customers to the “video pipeline” that we are building.
Where does the Video Pipeline Start?
Live video transmission resembles a stream of water. Or rather, a mountain river. The data flow is large, and there are many processes associated with its processing. However, when you build it yourself, you calculate and think over the location of each node. And the flow already resembles a real pipeline: with a bunch of branches, “cranes”, “pumps”, indicators, reserves, and automation.
The “video conveyor” begins (as it ends) always in the same place:
At the entrance: the camera lens captures the picture
At the exit: the viewer views with his own eyes from the device screen
From the lens to the eye - this is the complete pipeline. The viewer sees and hears the content that has been prepared for him.
The Main Elements of the Video Pipeline
- Capturing raw video
- Video Compression
- Packaging in a media container
- Online delivery
- Unpacking the media container
- Decompressing the video
- Playback
Let’s look at these steps using the example of an IP camera and a VLC player:
- The matrix digitizes the signal from the lens, turning the light into a stream of bytes divided into separate frames. This is what we call raw video: a sequence of uncompressed frames.
- The camera processor compresses video using one of the codecs (H264, H 265), which reduces the amount of data to be transmitted over the Internet.
- The camera’s RTSP server packages H264 frames using RTP.
- RTP packets go over the network over TCP or UDP.
- The receiver accumulates a buffer of frames from the RTP stream.
- The player decompresses the video.
- The image is displayed in the player window.
I would not be able to show all these steps using the example of Flussonic Media Server, because its tasks do not include playing and displaying content. The full path using Flussonic will be larger and more complicated – this is how it happens in real services.
Larger and More Complicated
The playback directly from the camera, as in the example above – is one of the simplest pipelines. As for Flussonic, he has other roles. For example, there may be no capture of raw video, but in return it will be required to receive already compressed video over the network. In general, it is possible to reduce Flussonic tasks to changing codec parameters, container parameters, recording and multiplexing.
Let’s look at the Flussonic video pipeline using the example of an OTT service, where streams received by the head station are used as a source.
So, at the input we have UDP Multicast with H264 codecs (interlaced) + MPEG-2 Audio, sometimes ac3 on HD channels. We need to give multi bitrate HLS and DASH to the output.
How it will look like within a single server:
- UDP Capture.
- Unpacking containers, reading MPEG-TS and extracting "frames" (in this case, audio and video packages can be called in one word).
- Unpacking original codecs (getting raw video and audio).
- Encoding to several qualities (1080, 720, 576, 480) with the H264 codec with progressive scan (there is an interlace at the input, but we don't need it!). The sound is encoded to AAC.
- Compressed video, strictly according to the GOP, is packed in MPEG-TS and MP4 containers.
- Generating HLS and DASH manifests (m3u8 and xml).
- Distribution of live content to subscribers via HTTP protocol.
However, in reality, it is impossible to do all the work on one server. On the one hand – a lot of channels, on the other – a lot of subscribers. We will divide one server into two, assigning them different roles: “Capture and transcoding” and “Distribution” (highlighted in different colors):
- UDP Capture.
- Unpacking the original codecs (getting raw video and audio).
- Encoding to several qualities (1080, 720, 576, 480)with the H264 codec with progressive scan; audio is encoded to AAC.
- Packaging frames in M4F, codec-agnostic container for transfer between Flussonic servers.
- Video transmission over HTTP protocol.
- M4F reception, frame extraction.
- Compressed video, strictly according to GOP, is packed in MPEG-TS and MP4 containers.
- HLS and DASH manifests generation (m3u8 and xml).
- Distribution of live content to subscribers via HTTP protocol.
You can go further and add the record of the archive to the server. It turns out three roles: Ingest+Transcoder, DVR, Edge.
And do not forget that transcoding is a resource-intensive task, so we need several transcoders. There are also a lot of viewers, and several servers will also be needed to serve their requests. The recording can be done on one server:
- UDP Capture.
- Unpacking the original codecs (getting raw video and audio).
- Encoding to several qualities (1080, 720, 576, 480) with the H264 codec with progressive scan (there is an interlace at the input, we don't need it!); audio is encoded to AAC.
- Packaging frames in M4F – codec-agnostic container for transfer between Flussonic servers.
- Video transmission over HTTP protocol.
- M4F reception.
- Writing data to disk, grouped by hourly intervals.
- Video transmission over HTTP protocol.
- M4F reception.
- Caching of archive requests on SSD.
- Compressed video, strictly according to GOP, is packed in MPEG-TS and MP4 containers.
- HLS and DASH manifests generation (m3u8 and xml).
- Distribution of live and dvr content to subscribers via HTTP protocol.
As a result, we built a chain of 13 steps (in the example with an IP camera, we had 7 of them in total). Moreover, this is only a “piece” of the path that the video follows. The frames somehow came to Flussonic via satellite, compressed using a codec, packed into a media container – which means there were still 5-15 steps before Flussonic.
Our task has been completed: we have multiplied the signal from one coaxial cable to the Internet, simultaneously adapting it to the band and the requirements of the end devices. Further, the path of the video also still does not end – it is again waiting for unpacking and decoding on the final device. And possibly, the retransmission further, by other media servers to other networks. Maybe even cable.
Therefore, the video pipeline resembles an industrial pipeline with kilometers of “pipes” and connections: dozens of joints between programs, protocols, physical environments, servers, codecs. The same video goes through several stages of compression, travels thousands of kilometers through different physical environments, changes several codecs and containers.