Cockpit CI demands
Testing Cockpit is not an easy task – each pull request gets tested by over 300 browser integration test cases on a dozen operating systems. Each per-OS test suite starts hundreds of virtual machines, and many of them exercise them quite hard: provoking crashes, rebooting, attaching storage or network devices, or changing boot loader arguments.
With these requirements we absolutely depend on a working /dev/kvm
in the
test environment, and a performant host to run all these tests in a reasonable
time.
Unfortunately, the /dev/kvm
requirement is a killer for most public cloud
services. Many of them, like GitHub
actions,
regular AWS EC2 instances,
Azure, CircleCI,
or Testing Farm don’t
offer it at all. So we currently run our main workload on a bunch of
under-maintained RHEL 7 servers, and some of it on CentOS CI
– the latter is great, it runs OpenShift on bare metal, and we have a
“special agreement” with them to let them expose the kvm device into the
containers. Pretty well everyone else insists on shoving VMs in between the
nodes and Kubernetes, which has traditionally ruined the show.
We can run on EC2 by spinning up a “bare metal” instance when our main machines go down; but for these you really have to pay through the nose (about a hundred dollars per day for a third of our usual capacity), and they don’t really scale too well.
Hence, I was quite excited when I read that the Google Cloud officially supports nested virtualization, at least to a certain degree. So I decided to spend my Day of Learning on investigating that.
Google Cloud Run
I first looked at Cloud Run – structurally, this is exactly what we want:
Trigger the run of a few
quay.io/cockpit/tasks containers
depending on our queue length, and not be bothered by the management of the
nodes which run them. My first impression was really good: The
documentation and the gcloud
command line client are exceedingly well done
and helpful. The web UI is a bit crowded, but it’s easy enough to find what
you need.
So I had the “hello world” example running in just a few minutes. Building and
running our own cockpit/tasks
container was also straightforward. However,
no /dev/kvm
, there is no knob to enable it, and zero Google search hits about
how to enable it – so it seems it just can’t be done.
Google Cloud Platform
It seems one really needs to manage the underlying instances manually. GCP offers the
create-with-container
operation as a nice middle ground:
gcloud compute instances create-with-container kvmtest --container-image=gcr.io/google-containers/busybox \
--zone us-central1-a --enable-nested-virtualization --min-cpu-platform="Intel Haswell" \
--container-arg=ls --container-arg='-l' --container-arg=/dev/kvm
That has the magic options to supposedly enable /dev/kvm
, but the VM still
does not have it (and of course the container does not either). I think the
reason is simply that their “container optimized OS” does not have the
kvm_intel
kernel module. 😢
However, with a standard OS it works:
gcloud compute instances create kvmtest --zone us-central1-a --provisioning-model=SPOT \
--enable-nested-virtualization --min-cpu-platform="Intel Haswell" \
--image-project fedora-coreos-cloud --image-family=fedora-coreos-stable
Voilà , /dev/kvm. The default machine type is just rather underpowered for our
CI tasks. I spent a good amount of time iterating through all the
--machine-type
s to figure out the two which actually work (n2-standard-*
and c2-standard-*
), and noticing that what they sell as “8 CPUs” is really
just 4. An n2-standard-8
has 4 CPUs and 32 GiB RAM, which is enough to run a
single OS validation with 8 parallel tests.
Benchmarking
Then I checked out cockpit and ran a few initial benchmarks, comparing them with a local run on my ThinkPad X1 (Intel i7). In both cases I ran four tests in parallel.
test/image-prepare
is by and large a make dist
and rpmbuild
, i.e.
compiler heavy. It takes 3 min 30s on my ThinkPad, and 6min 50s on the GCloud
instance. Running all ~ 12 TestAccounts
integration checks is similar: 3min
14s local, 6min 36s on GCloud. They run a different workload (Chromium outside
on the host, cockpit and OS operations in the VM), but the “2x” slowdown was
very similar.
Finally I ran the whole suite. Our real-iron machines run 5 tests in parallel,
and shred through all of them in 35 to 45 minutes. On the GCloud instance I
ran 8 parallel tests, and it took more than two hours; worse, I got 15 test
failures, mostly due to timeouts and repeated crashes. While it ran, I checked
top
and virsh list
every now and then, and the VM felt really sluggish –
sometimes virsh
hung for a minute, and there are always dozens of zombie
processes around which did not get cleaned up for a long time.
I re-did that benchmark for all three working machine types (default with
--custom-{cpu,memory}
and {c,n}2-standard
); there were some minor timing
differences (which could also be just noise), but the general feeling of
sluggishness was the same.
Conclusion
There is really no replacement for running on bare metal for our kinds of workload.
I’m sure that by reducing parallel tests dramatically, they could eventually succeed; but that would mean to scale up the VMs, and thus both resource/energy usage and cost, beyond what’s reasonable. I was hoping that nested KVM had become better in the last 5 years (last time I seriously tried it), but seems not. 😢
Appendix: Integration with Ansible
While the tests were running, I looked into automating the setup of such an instance. Our cockpituous repo has Ansible playbooks for all our other machines, and there is a module for GCP as well. That documentation leaves something to be desired though, and there are some outright bugs, so it took quite a long time to get something working. Still a bunch of FIXMEs, but as we are not going to use that after all, I just want to keep it here if I ever need to run something else on GCP.
The main playbook ansible/gcp/launch-tasks.yml
:
# Requirements:
# ansible-galaxy collection install google.cloud
# dnf install python3-google-auth
# create service account, store credentials in json file, point to it in $GOOGLE_APPLICATION_CREDENTIALS
# Run "gcloud compute ssh <instancename>" once to set up ~/.ssh/google_compute_engine
---
- name: Create tasks runner GCP instance
hosts: localhost
gather_facts: false
vars_files: gcp_defaults.yml
tasks:
# https://docs.ansible.com/ansible/latest/collections/google/cloud/gcp_compute_instance_module.html
# FIXME: spot instance (not supported by Ansible module)
- name: Create instance
google.cloud.gcp_compute_instance:
auth_kind: "{{ gcp_auth_kind }}"
service_account_file: "{{ lookup('env', 'GOOGLE_APPLICATION_CREDENTIALS') }}"
zone: "{{ gcp_zone }}"
project: "{{ gcp_project }}"
machine_type: c2-standard-8
# FIXME: parameterize for n_instances, or auto-generate one at random
name: cockpit-tasks
disks:
- auto_delete: true
boot: true
initialize_params:
disk_size_gb: 20
#image_family: fedora-coreos-stable # FIXME: does not work, need to search for image name dynamically
source_image: "projects/fedora-coreos-cloud/global/images/fedora-coreos-35-20220213-3-0-gcp-x86-64"
# FIXME: add persistent image cache disk
network_interfaces:
- access_configs:
- name: External NAT
type: ONE_TO_ONE_NAT
register: gcp
- name: Add new instance to host group
add_host:
hostname: "{{ gcp.networkInterfaces[0].accessConfigs[0].natIP }}"
groupname: launched
ansible_user: "core"
ansible_become: yes
ansible_ssh_private_key_file: "{{ lookup('env', 'HOME') }}/.ssh/google_compute_engine"
ansible_ssh_common_args: "-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o CheckHostIP=no"
# from here on it's identical to ansible/aws/launch-tasks.yml
And the referenced ansible/gcp/gcp_defaults.yml
:
gcp_zone: us-central1-a
# FIXME: "My First Project", change to production one
gcp_project: organic-airship-343105
gcp_auth_kind: serviceaccount