Server Stats Dashboard¶

The Server Stats dashboard provides high-level monitoring of server load metrics across your entire streaming infrastructure.

Its overview section is similar to standard server monitoring tools, but the important thing is that it's already included in the service and you don't need to monitor additional agents.

Most Overloaded Servers¶

This panel displays servers with the highest resource utilization across key measured parameters.

Displayed Information:

Servers with maximum load on at least one parameter are raised to the top
Load distribution over time. It's recommended to change the time range - narrowing it down to hours and minutes may bring servers with recent issues to the top

Use Cases:

Select the last 2-3 hours in the dashboard, see servers with red indicators, proceed to resolve the issue
Select the last 7 days in the dashboard, see periodic load spikes up to red levels. Identify the source of periodic growth and if it shouldn't exist, resolve the issue
Select the last 30 days in the dashboard, see steady load growth, start planning infrastructure expansion

Overloaded Service Example

On this service, the CPU and load situation is all in the yellow zone. This can be considered efficient server utilization because there are no signs of plateauing on the CPU graph, however these servers cannot handle any additional load.

Set up alerts: To avoid missing critical server load, configure CPU load alerts and scheduler load alerts. They will warn you before the server stops handling the load.

CPU and Virtual Machine Load¶

The following two panels provide an overview of central processor and virtual machine load. These are different metrics - you cannot rely on just one.

A common mistake by system administrators is making decisions without knowledge of virtual machine operation and complaining about CPU growth without considering the scheduler.

80% CPU load is not critical, although elevated. The streaming server's operability must be assessed by scheduler load, which is a more reliable metric. Virtual machine operation may involve high CPU load, and in special cases, 100% full load of several CPU cores is normal.

Important: you cannot extrapolate CPU load by adding several streams and expect linear growth. The internal mechanisms that allow scaling a streaming server to thousands of simultaneous streams have certain costs, so growth will be non-linear.

On this graph and further graphs are divided by versions. Very convenient to track: whether there was actual performance degradation or just an illusion.

Critical Load Example

On these two graphs, you can see that CPU is at the limit (this service was mentioned above) and possibly hitting a plateau, i.e., not coping, but the scheduler graph shows there's no plateau - the server is just at the limit. However, you'll see later that its disks are not coping.

Set up alerts: To avoid missing critical server load, configure CPU load alerts and scheduler load alerts. They will warn you before the server stops handling the load.

Memory Utilization¶

RAM usage monitoring on servers.

Shows total memory usage - this is the graph to focus on.

Normal state is stable.

The graph doesn't show swap usage for one simple reason: on a streaming server, swap should be disabled. It's not needed in any scenario, and it can lead to the system going into failure instead of emergency shutdown, resulting in dozens of minutes of downtime instead of a few minutes.

Set up alert: Configure a low memory alert to get notifications when approaching RAM limits and take timely action.

Disk Write Errors¶

Tracking disk write errors.

Displayed Metrics:

Collapsed writes to disk
Failed write attempts

When storage begins to lag in writing (this is normal behavior for network storage, which very often cannot deliver constantly stable and predictable write speeds for weeks), the media server first starts grouping adjacent writes. Each such grouping is reflected in the Collapsed writes graph and represents an alarming situation. This shouldn't happen, but it's not a problem yet.

Typically, after prolonged ignoring of collapsed writes, you can observe write failures: failed writes. The media server cannot keep segments in the write queue forever - it stores them only as long as they are available for live viewing, so when the playback window passes and the write hasn't occurred, the video is lost irretrievably.

This is already a serious service failure.

Use Cases:

Detect failing storage hardware
Identify filesystem issues
Plan disk replacement
Monitor storage reliability

Set up alert: Disk issues alert is critical for servers with DVR. It will warn you when collapsed writes appear before failed writes begin with video loss.

Disk Utilization¶

Monitoring disk operation speed allows you to see potential problems on individual disks.

This section makes sense to visit if there are indications of disk write or read errors. Otherwise, this section is informational in nature - check once a month that there are no anomalies.

Disk I/O Percentage

Normally should be no more than 80%. Individual spikes are acceptable, nothing critical if they don't lead to service operation errors.

Disk Fill Level

Should be stable around the limit. 98% is normal if your storage uses ext4. If you decided to use btrfs, service failure is possible at 55%, but we won't be able to help you.

Disk Write Speed

There's no single norm - it differs by orders of magnitude for spinning disks and NVMe. Sharp jumps or plateaus are of interest.

Disk Read Speed

Similar to the previous: there's no single norm. Pay attention to sharp changes, as well as trends over months.

Example of Poor Disk Choice

From the graphs above, you can draw the following conclusions:

Disks were purchased very large and wasted - they cannot be filled with data
Some disks are overloaded with writes. It might make sense to contact support for help finding a strategy for more even disk load distribution
Flussonic RAID copes with protecting adjacent disks from overload. Despite one failing disk, the rest handle the workload

Streams and Clients¶

Monitoring of active streams and connected clients. This is an overview graph, more detailed picture is on adjacent dashboards.

Number of Streams

Recommended to monitor during upgrades and ensure the picture doesn't change after upgrade.

Number of Clients

Recommended to monitor during upgrades and combine with load balancer operation, ensure the number reaches the pre-upgrade level.

3-Month Retrospective

You can see that on this server over 3 months, the number of streams didn't change sharply - they are smoothly migrated to new servers.

Network Traffic¶

Monitoring of incoming and outgoing network traffic.

Incoming Traffic to Media Server

Gives an idea of how much incoming traffic the media server itself sees. Sometimes can differ radically from the system graph if there's capture from ASI, SDI, or loopback.

Incoming System Traffic

How much traffic enters the operating system. Can differ noticeably and be the cause of network failures if there are other traffic consumers besides the media server.

Outgoing Media Server Traffic

Outgoing System Traffic

GPU Status¶

Monitoring GPU load and status for transcoding and video analytics.

The GPU utilization graph is a general aggregate, it's worth additionally reviewing encoder/decoder graphs, but as a general reference - it works.

GPU cooling is mandatory. If you let the temperature reach 90 degrees, the consequences can be very difficult to diagnose.