Flussonic Data Model
What is media in Flussonic
Media is a stream or a file played in Flussonic. Each media has a name and a set of data.
To manage media data conveniently and effectively, we use a data model that allows to divide media into separate elements. This data model is the same for streams and files. Let us consider the parts of media in Flussonic.
Parts of played media
Each media can be divided into separate tracks that represent video, audio, or text (e.g., subtitles). For example, a movie can contain one video track, three audio tracks (English, German, and Russian), and three corresponding subtitle tracks.
Each track is characterized by its content, i.e., physical substance (video, audio, or text), and several other parameters. The set of track parameters depends on its content. For example, a video track can have width and height of the displayed image as well as frame rate — the speed at which a sequence of images is displayed on a screen. Audio track can have other parameters, such as language and sample rate — the number of samples per second taken from a continuous signal from a microphone (or another audio source) to make a discrete or digital signal. Text tracks are very simple and don’t have any specific parameters.
Flussonic automatically assigns an identifier to each track, for example, "v1, v2, …" — for video tracks, "a1, a2, …" — for audio tracks, and "t1, t2, …" — for subtitle tracks.
Each track, independent on its content, can be divided into frames.
Frame is the minimal piece of a track. A frame can be a part of video, audio, or text track. For a video track, a frame is one of the many still images which compose the complete moving picture. Each frame has a start time and frame duration. Frame duration has a different meaning for audio and video.
For an audio track, frame duration depends on sample rate. For example, CDs are usually recorded at 44.1 kHz – which means that every second 44,100 samples are taken. In this case 1/44100 seconds can be considered as an audio frame duration.
For a video track, frame duration is the time between the beginning of a frame and the beginning of the next frame. This parameter is important for some protocols. Normally, frame duration is a difference between timestamps (start times) of two adjacent frames. However, sometimes (when the connection is broken) video breakups are possible. As a result, the delta between two consequent frame timestamps will not be equal to the frame duration. This situation is considered as a frame gap and is handled differently across different protocols. For example, HLS protocol will continue playback, however DASH protocol will break the playback and start a new period (learn more here).
The important feature of Flussonic data model is that frames never overlay each other. Overlaying frames can result, for example, in such a problem as subtitle overlapping. Flussonic allows to avoid such a problem because a frame cannot start earlier than the start of the previous frame + its duration.
To deliver video over the internet using a limited bandwidth, it is often necessary to compress the video. Besides compressing frames themselves, there is a more progressive technology called interframe compression. It works by sending full frames (referred to as keyframes), and then only sending the difference between the keyframe and the subsequent frames. The receiver (decoder) uses the keyframe plus these differences to re-create the desired frame with reasonable accuracy.
For interframe compression purposes, frames in a track are grouped into GOPs. GOP (group of pictures) is a structured group of successive frames in a video stream or file. Each GOP consists of an I-frame (keyframe) followed by P-frames and B-frames:
- I-frame (keyframe) is the first frame in a GOP. It is a full image encoded independently from other frames (meaning no links to them). Each GOP has a keyframe at the start.
- P-frames contain the difference between the previous P-frame and a current frame. It is encoded with a link to an I-frame.
- B-frames contain links to I-frames and P-frames before and after themselves.
A typical GOP contains a repeating pattern of B- and P-frames following the keyframe. An example of a typical pattern might be the following:
I B B P B B P B B P B B
Ideally, keyframes should be selected when a scene changes (so called scene detection method). However, most programs for processing video are configured to work with GOPs of equal size. Therefore, in most situations equal GOPs are used, for example, the TV standard is 28 frames in a GOP.
It is important to understand that a GOP without a keyframe has no sense. Thus, it is impossible to play video just in the middle of the GOP.
Grouping into GOPs is applicable to video frames only. Corresponding audio and text subtitle frames are added to GOPs synchronously.
What would be the optimal GOP length?
Why a GOP should not be too long? Because a longer GOP can result in a bigger zap time – duration of time from which the viewer changes the channel using a remote control to the point that the picture of the new channel is displayed. If a viewer clicks the remote control before the previous GOP has finished, they see unactual picture. This problem may be critical for video games or video calls.
To solve this problem, Flussonic uses the prepush feature: it saves each GOP in the buffer before sending it to a client. When a client connects to the server, the server sends the first GOP from the buffer and then transmits a stream with a timeshift — the delivery lags for a time interval equal to the size of one GOP converted to seconds. When the connection with the server breaks or slows down, the client plays a GOP from the buffer. In this way, video is played more evenly, however, a latency may grow.
Why a GOP should not be too short? Because longer GOPs provide better compression.
Different applications use different GOP lengths, but typically these lengths are in the 0.5 – 2 second range.
In some cases, it is possible to compress video even better by using so called open GOPs. Open GOP contains P-frames that refer to the frames before the keyframe. This allows to lower bitrate by 5-7 %. However, open GOP may result in problems when it comes to using segments.
Segment is the next-level element in our data model. It contains one or more GOPs with corresponding audio and text frames (synchronized with video frames). Segments are necessary for some protocols, such as HLS and DASH.
Sometimes a segment can match a GOP, sometimes not. However, it is always divisible by a GOP and starts with a keyframe.
The important feature of the transcoder in Flussonic is that all segments for all video tracks are always synchronized. When encoding video with some other (not very good) transcoder, it is possible that one videо track has already started playing a new segment, while another video track is still playing the previous segment. In this case it may be difficult for a player to switch to another video track, therefore such situation is unacceptable for multibitrate streaming. In Flussonic, all segments have the same size and the same timestamps as the corresponding segments in another video track. That is why all video tracks are played synchronously.
Please note that audio frames have another frame duration than video frames, so it is possible that when a new segment starts playing, audio frames are still being added to the previous segment. This is a normal situation.
We store segments isolated from each other. Sometimes it may result in problems when using open GOPs because a P-frame cannot refer to frames from the previous segment. In this case a picture may sometimes break up, so it is necessary to wait for the next segment for better picture quality.