Server Stats Dashboard¶
The Server Stats dashboard provides high-level monitoring of server load metrics across your entire streaming infrastructure.
Its overview section is similar to standard server monitoring tools, but the important thing is that it's already included in the service and you don't need to monitor additional agents.
Most Overloaded Servers¶
This panel displays servers with the highest resource utilization across key measured parameters.
Displayed Information:
- Servers with maximum load on at least one parameter are raised to the top
- Load distribution over time. It's recommended to change the time range - narrowing it down to hours and minutes may bring servers with recent issues to the top
Use Cases:
- Select the last 2-3 hours in the dashboard, see servers with red indicators, proceed to resolve the issue
- Select the last 7 days in the dashboard, see periodic load spikes up to red levels. Identify the source of periodic growth and if it shouldn't exist, resolve the issue
- Select the last 30 days in the dashboard, see steady load growth, start planning infrastructure expansion
Overloaded Service Example
On this service, the CPU and load situation is all in the yellow zone. This can be considered efficient server utilization because there are no signs of plateauing on the CPU graph, however these servers cannot handle any additional load.
CPU and Virtual Machine Load¶
The following two panels provide an overview of central processor and virtual machine load. These are different metrics - you cannot rely on just one.
A common mistake by system administrators is making decisions without knowledge of virtual machine operation and complaining about CPU growth without considering the scheduler.
80% CPU load is not critical, although elevated. The streaming server's operability must be assessed by scheduler load, which is a more reliable metric. Virtual machine operation may involve high CPU load, and in special cases, 100% full load of several CPU cores is normal.
Important: you cannot extrapolate CPU load by adding several streams and expect linear growth. The internal mechanisms that allow scaling a streaming server to thousands of simultaneous streams have certain costs, so growth will be non-linear.
On this graph and further graphs are divided by versions. Very convenient to track: whether there was actual performance degradation or just an illusion.
Critical Load Example
On these two graphs, you can see that CPU is at the limit (this service was mentioned above) and possibly hitting a plateau, i.e., not coping, but the scheduler graph shows there's no plateau - the server is just at the limit. However, you'll see later that its disks are not coping.
Memory Utilization¶
RAM usage monitoring on servers.
Shows total memory usage - this is the graph to focus on.
Normal state is stable.
The graph doesn't show swap usage for one simple reason: on a streaming server, swap should be disabled. It's not needed in any scenario, and it can lead to the system going into failure instead of emergency shutdown, resulting in dozens of minutes of downtime instead of a few minutes.
Disk Write Errors¶
Tracking disk write errors.
Displayed Metrics:
- Collapsed writes to disk
- Failed write attempts
When storage begins to lag in writing (this is normal behavior for network storage, which very often cannot deliver constantly stable and predictable write speeds for weeks), the media server first starts grouping adjacent writes. Each such grouping is reflected in the Collapsed writes
graph and represents an alarming situation. This shouldn't happen, but it's not a problem yet.
Typically, after prolonged ignoring of collapsed writes, you can observe write failures: failed writes
. The media server cannot keep segments in the write queue forever - it stores them only as long as they are available for live viewing, so when the playback window passes and the write hasn't occurred, the video is lost irretrievably.
This is already a serious service failure.
Use Cases:
- Detect failing storage hardware
- Identify filesystem issues
- Plan disk replacement
- Monitor storage reliability
Disk Utilization¶
Monitoring disk operation speed allows you to see potential problems on individual disks.
This section makes sense to visit if there are indications of disk write or read errors. Otherwise, this section is informational in nature - check once a month that there are no anomalies.
Disk I/O Percentage
Normally should be no more than 80%. Individual spikes are acceptable, nothing critical if they don't lead to service operation errors.
Disk Fill Level
Should be stable around the limit. 98% is normal if your storage uses ext4. If you decided to use btrfs, service failure is possible at 55%, but we won't be able to help you.
Disk Write Speed
There's no single norm - it differs by orders of magnitude for spinning disks and NVMe. Sharp jumps or plateaus are of interest.
Disk Read Speed
Similar to the previous: there's no single norm. Pay attention to sharp changes, as well as trends over months.
Example of Poor Disk Choice
From the graphs above, you can draw the following conclusions:
- Disks were purchased very large and wasted - they cannot be filled with data
- Some disks are overloaded with writes. It might make sense to contact support for help finding a strategy for more even disk load distribution
- Flussonic RAID copes with protecting adjacent disks from overload. Despite one failing disk, the rest handle the workload
Streams and Clients¶
Monitoring of active streams and connected clients. This is an overview graph, more detailed picture is on adjacent dashboards.
Number of Streams
Recommended to monitor during upgrades and ensure the picture doesn't change after upgrade.
Number of Clients
Recommended to monitor during upgrades and combine with load balancer operation, ensure the number reaches the pre-upgrade level.
3-Month Retrospective
You can see that on this server over 3 months, the number of streams didn't change sharply - they are smoothly migrated to new servers.
Network Traffic¶
Monitoring of incoming and outgoing network traffic.
Incoming Traffic to Media Server
Gives an idea of how much incoming traffic the media server itself sees. Sometimes can differ radically from the system graph if there's capture from ASI, SDI, or loopback.
Incoming System Traffic
How much traffic enters the operating system. Can differ noticeably and be the cause of network failures if there are other traffic consumers besides the media server.
Outgoing Media Server Traffic
Outgoing System Traffic