Today is may last day at work for this year. I spent the last quarter working in the Red Hat Installer team, on a temporary rotation. They needed some help with their testing workflows and CI, it was a good chance of reducing “bus factor 1” activities in my home team (Cockpit), and for me personally it was a great opportunity to make new friends and learn new stuff. So, win-win-win!
I got a very warm welcome, everyone was very helpful with showing me around, were open to changes, and gave me feedback about my proposals and implementation. So I can wholeheartedly recommend such a rotation to everyone. Thanks to the team for a wonderful quarter!
CI Design principles
Tests and automated maintenance procedures get defined in terms of containers and GitHub workflows. There must not be magic infrastructure! That should be like an Ork, very strong and sturdy, but dumb. It should not do anything different than a developer would do on their own machine, but just do it much more often.
Containers provide a well-defined and reproducible environment. They are supported by GitHub, GitLab, Travis, CircleCI, kubernetes, OpenShift, i.e. pretty much every cloud and CI system out there. There are lots of tools, infrastructure, and automation available around them. Likewise, a developer can easily run them locally with podman, docker, LXC/lxd, or systemd-nspawn without assuming much about or endangering the host system.
GitHub Workflows are YAML/shell scripts that run on any GitHub event, such as opening/updating a pull request, opening or commenting on an issue, a scheduled time, or a manual run. They run in a clean environment, which ensures that the workflow contains every required step and dependency. Workflows are kept in the upstream repository, and thus co-evolve with the project, are accessible, can be ran from PRs to test workflow changes.
While their syntax is unique to GitHub, the general concept is quite wide-spread these days. GitLab pipelines, semaphore 2.0 pipelines, and in a more narrow sense (focus on tests and releases) Travis Jobs. So if it ever becomes necessary to move away from GitHub, there will surely be some work to port the workflows to some new syntax, but that’s not particularly difficult.
For example, the kickstart-tests container-autoupdate.yml workflow refreshes the quay.io/rhinstaller/kstest-runner container image every week. A more complex case is anaconda’s validate.yml workflow which runs tests on pull requests; it uses scenario matrixes, conditional steps, and passing information between steps, so understanding it may take a while – but it is still very explicit and IMHO readable. The central bit is
make -f Makefile.am container-ci, which is exactly the command that a developer runs on their local machine; the rest is more or less setup, logging, artifacts, and credentials.
With the above design decisions and containerization, the requirements on the infrastructure are relatively easy: It must be powerful enough to sustain the required workload (which is mostly a question of cost), and able to run containers (but these days, that’s the case everywhere). That makes it easy to migrate workloads between different providers (“hybrid cloud”), especially for implementing on-demand fallbacks on outages.
The best infrastructure is of course that which we don’t have to maintain ourselves. GitHub currently offers gracious unlimited resources for Free Software projects, so a big “thank you” to Microsoft at this point! (Gosh, if you told my younger self from 20 or 15 years ago that, I would have laughted in your face!) Thus we are using that whereever possible, like for running unit tests for Fedora or refreshing container images.
The main limitation with GitHub’s VM instances is that they don’t offer
/dev/kvm, thus you can’t start your own VMs. That is the bread and butter of installer integration tests, so for kickstart-tests we have to use something else. The only public CI provider known to me who offers that is Travis. We quickly ran into their limit of 10,000 free CI minutes every month, so we upgraded to a paid plan, and it is now chugging along nicely.
That leaves tests for RHEL products/branches, which need to happen on Red Hat internal infrastructure. For that I chose to bring up some self-hosted GitHub runners in our internal “upshift” OpenStack cloud. They are very simple to set up, essentially “bring up a big enough instance”, then install podman and the GitHub action runner; the latter listens to GitHub job requests, runs the workflow, and reports back the results and artifacts. With these we can keep the workflows that run on public and private infrastructure very uniform. Unfortunately the Ansible scripts for that are not in a public repository right now, but they could be published easily.
I also provided a script to start such a runner on any system in a container, which makes it very easy (but still safe) to write and debug such workflows.
In my intro mail I said
As a developer, it must be easy, fun, and safe to run, write, and debug tests on a local machine. If it’s not, developers just won’t do it, and as a result, software will keep being buggy.
That means that after cloning the git repository there needs to be one documented and safe command to run the tests in the standard scenario. This is the case now for both the Anaconda (installer) project itself, as well as the integration test suite kickstart-tests.
The latter was quite a challenge – it took weeks of work to change the tests to not require root privileges, loop devices, and other difficult assumptions any more, so that they can run safely on developer and CI machines in containers. Also, a lot of them have failed for a long time, so it took quite some time to sift through them and fix/mark/skip them accordingly. After that, providing a convenient developer workflow and deploying the tests on CI was reasonably straightforward with the above design.
At the end of this quarter, each and every pull request to anaconda and kickstart-tests is gated by tests, rhinstaller members can request kickstart-tests in Anaconda PRs, and the nightly runs of all kickstart-tests in various scenarios are now reasonably reliable and normally passing.
The main purpose of these CI improvements is to become able to land changes with confidence. But this of course only works with a certain discipline: the nightly and PR tests must be kept green. Regressions need to be investigated immediately, and reported/marked/skipped accordingly (broken windows theory), otherwise they quickly lose their value.
Also, right now the unit tests only run for Fedora Rawhide, Fedora ELN, and RHEL 8; and the kickstart-tests only for Rawhide and RHEL 8. These should quickly be expanded to cover all supported OSes, mostly CentOS Stream and RHEL 9.
Finally, log and result/artifact browsing for GitHub workflows is not very convenient, and not easily accessible for data extractions/analysis. Programmatically it is easy to download and process the artifacts, but this does not currently happen.