How We Build a Release in One Minute

How it All Began

Any team working on a growing software project eventually has to adopt more internal infrastructure to manage the code’s complexity and the development team’s size. One part of this is a sort of internal bookkeeping: keeping track of who made what changes, and who owns what code. Another part is the technical side: quickly and regularly building a software project that 50+ developers work on. In this article, I would like to talk about how I and the rest of the Flussonic Media Server team arrived at our current production build toolchain.

The Flussonic Media Server team has been working on the project for many years. Today, most of our development processes are automated. Different teams of developers use different degrees of automation in their projects. There is nothing wrong with using less automation, because every project is different.

Our project started like most others. We had a small development team, and one team member was responsible for building the source code, bundling it into a .deb package with dependencies and config scripts, and putting it in the Debian repository. Building a package from source seems easy at first glance, but as projects grow, the build process accumulates annoying details.

There were several problems with our approach at the beginning. One was that whenever the person responsible for the build went on vacation or took sick leave, development would slow down dramatically because there was nobody else who could do their job.

Another problem was that the build was only run on one computer. Any time the computer was updated, it was impossible to test builds, which slowed development. Building on only one machine also meant that end-users experienced some compatibility issues, because we did not usually test the project on other processor architectures and system configurations. This is a common issue in developing off-the-shelf software, because developers do not necessarily know where their code is running.

As headaches and problems grew, it became clear to us that these processes needed to be automated.

The DevOps philosophy says that developers should take active part in delivering the result of their work to the end-user. Until the developer sees the feedback, they cannot assume that their work is done. This requires implementing changes in the developer team’s workflow, and a thinning or removal of the barrier between development and operation. To do this, we employed certain tools and processes, which together make up our automated build system.

We develop Flussonic, but the operation is left up to our customers - this means that we can never completely close the gap between development and operation. However, we do everything we can to improve the software with extensive testing, and make it easier to install and correctly operate.

One strategy we use to remedy the issue is Continuous Integration. About 10-20 years ago, updates for off-the-shelf software were only released once about every one and a half years; it was believed that there was no need to release updates more often. Now, this approach is obsolete, and the industry has moved towards more frequent releases. Releasing a big update once a year and a half is a very difficult, time-consuming and painful task for both the vendor and the customer. Every such update is in effect a reimplementation of the software. In the time since those days, the global spread of the Internet across the world and increase of the number of computers connected to it has enabled developers to release software updates more quickly and more smoothly. Our enormous CI test suite is the biggest reason that we can release ‘nightly,’ bleeding-edge builds every day, and ensure that they are no less stable than the public, ‘stable’ releases which are actually made available to clients.

Obviously, we cannot compile all of these frequent nightly releases manually. We use the open-source GitLab CI pipeline in production, which allows us to run scripts on the server for each commit and each branch, and to build each new version from scratch every time. Using Docker as a build environment for Flussonic has made building a lot easier and has increased developer happiness.

docker run or docker build?

There are two approaches to creating a build system with Docker, or more precisely, two different docker commands. Though they are very similar, and can at some points replace one another, it is important to carefully examine both.

docker run starts a Docker container. When developers say, “We have everything in Docker,” this is usually what they mean. We do not use this command as a part of our build system, it is only used to run tests.

There is a common pattern for using docker run to build software - mount a directory with source files into the container and perform the build step in the container itself, delegating compilation to the clean and isolated Docker environment. There is a problem with this approach: If we want to build more than one branch in parallel, several containers have to work with one set of files, and will mess up each other’s build processes. This means that the environment must be cleaned between builds, since artifacts from one build can mess up other build processes.

We needed to be able to run several builds and tests at the same time, which is why we decided to only use docker build.

docker build creates an image, which can be launched using docker run. In our build system, this image is only used for running tests, and functions as sort of isolated, reliable, and stable Makefile in the build. Building in isolation with docker build protects the compilation process from old, unnecessary, or foreign data that can interfere with the build and cause errors.

Moving to this kind of build process from an old, sequential system is not entirely painless. Some of our old sequential build scripts broke under the new system, and had to be rewritten.

How We Made Friends With docker build

Caching docker steps is the most important feature of the Docker-based build system for our use case. Since we are using docker build, it may seem that we cannot keep intermediate build artifacts, and a new build will have to start over from the very first step. But this is not exactly true: docker build caches every step it can. Each time a step is executed, if the state of the container at the previous step has not changed from the last build, and if the command that will run the current step has not changed, then the result of the current step is cached.

There is a learning curve to tackle in working with docker build. To ensure that the build runs smoothly and quickly, developers must make sure that Docker caches every step it can for faster build times. This means learning how to investigate the logs Docker produces, and how to optimize the Dockerfile.

For example, there may be a complicated shell script that has not been migrated over fully from the old sequential-building process, that is called inside Docker, and that runs for a full 20 minutes. At the end of the script, there are some commands which need adjusting. In this situation, it would be necessary to flush the cache and restart the compilation process to run another build. It would be worth breaking this script into smaller parts to save 20 minutes of compile time with caching - but how small should the parts be? We came to the conclusion that the operations that clearly take us a lot of time and rarely change should be carried out separately, higher up in the Dockerfile.

It is important to keep in mind that when we play with the Dockerfile and select the right commands, Docker stores the entire tree of variations. Docker caches not only the last run but all options during the assembly. This means that caching uses an incredible amount of disk space, but it is worth it. However, Docker does not come with a built-in way to intelligently clean the cache and delete only things that are not needed, so a custom solution is required.

In our build process, one mandatory step is to compile all the dependencies. All of our project’s dependencies take up about a gigabyte of disk space, and can be downloaded and compiled in about 4 seconds. Docker steps through the Dockerfile, sees that the initial instructions have not changed, meaning that nothing needs to be recompiled, and moves onto the next step: uploading to the package repository. There is a trick that we came up with - our uploader knows not to upload a package to the server if it’s already present.

When we create a new branch, all of its dependencies are compiled or taken from the cache, which also happens in about 4 seconds with our in-house package loading system. All you need to do is to query whether a package exists in the new branch and all of the dependencies will be automatically populated in the newly created folder with the new branch.

In the past, the process of recompiling dependencies took us about a week. This was not because the build took a week to perform, but rather because no developer wanted to frequently get through the tedious process. Some dependencies, such as Python, took a long time to compile and test, so we did not want to rebuild them each time a new commit was pushed. A major tenet of the DevOps philosophy says that any procedure that can be codified but not executed regularly should be considered broken and unreliable. As a result, rebuilding our version of Python is now a routine task that will immediately pass through all the tests and will only be made available to customers if everything went well.

Branch Hygiene

Our build process uses a multi-stage Docker setup. We found that copying data from one image to another directly saves a lot of time as compared to transferring data through the context. Dividing a long build process into smaller steps can also help parallelize it: for example, it may be more convenient to both run tests and assemble packages at the same time.

Working with branches and multiple images, we encountered the problem of correctly specifying parent images. In a Dockerfile, a parent image is usually specified with the instruction:

FROM flussonic-source

However, we realized that it is better to not set the name of the parent image statically in the Dockerfile, but instead allow it to be specified by the context. When compiling, our process requires the full name of the parent images for the dependency, including the name of the branch, to be specified in the build arguments (set through the –build-arg flag, like so):

docker build –build-arg SOURCE_IMAGE=flussonic-src:${CI_COMMIT_REF_SLUG}

The FROM instruction in the Dockerfile accepts this build argument:

ARG SOURCE_IMAGE
FROM ${SOURCE_IMAGE}

In the configuration where the parent image is specified in the Dockerfile, every branch will share the same parent image. When building this way, Docker will generate the same sort of mess that results from two processes trying to build from the same directory. So, our Docker configuration passes around the branch name through each step of the build process. This makes Docker generate a lot of images, but that is okay - they will all be cached and the differences between them will be small. However, they do add up eventually, and these images are not cleared automatically. To remove all of the old branches, we use Docker Environments.

Testing

We can safely compile branches, and experiment on them without fear of something breaking in the master branch. The reason we do not worry is because we run a large suite of tests. This is the point in our build process where we use docker run. All of our tests are run in containers that do not leave behind traces or artifacts. We run tests in Docker, because in addition to isolating containers on the disk, Docker also isolates them on the network. If we write a test outside a container that has to listen to port 8080, then we cannot perform two such tests at the same time. However, Docker enables us to do this easily - two containers can use the same port and never interfere with each other. This frees up a bit of a developer’s time, that would otherwise be spent selecting free ports for tests.

After building, we run the new version through a series of tests. The most basic ones are acceptance tests, which take around 5-7 seconds to run and check the basic functionality of the new build - that Flussonic can transcode, respond to requests, and check authorization. This small set of tests is deeply integrated and validates the largest possible number of subsystems in one go. If the developer who committed the new version broke something, they will not have to wait for the larger test suite to finish running to know about it.

After this, the package is created and the more time-consuming tests are run. They take about 20-40 minutes and are highly parallelized. Recently, we lowered the testing time to 10 minutes, which has greatly improved the speed of iteration. For some projects, a full run may take a day, which makes it impossible to wait for test results online. External high-level tests provide continuous integration and continuous delivery. We check that everything is assembled, installed, started, and works as it is supposed to in each new commit.

Future Plans

Currently, we conduct all tests on Intel processors, and we are planning to create the same robust external testing infrastructure on ARM 64 systems.

We want to rewrite as many tests as possible to be blackbox tests, that is, to test Flussonic without access to the source code. Simply put, the tests should be written outside Flussonic in Python, not inside Flussonic in Erlang. These kinds of tests will be very reliable and long-lived, because the protocols Flussonic works with are very stable, and will not change for 10-15 years.

We are also going to try to launch Flussonic builds from Kubernetes. This GitLab feature looks like a very promising solution for manual testing. From a certain branch in the repository, you could run a separate cluster with its own set of hostnames. At the end of the test, the branch is deleted and no longer takes up system resources.

This is how, slowly and gradually, we came to our current automated build system. Today, builds that used to take weeks, take seconds. It is important to understand that DevOps is an ongoing process - you can always do better. Our development team is always trying new approaches, and experimenting with new technologies to engineer a better build system.