Evaluating Google Cloud for Integration Testing

Cockpit CI demands

Testing Cockpit is not an easy task – each pull request gets tested by over 300 browser integration test cases on a dozen operating systems. Each per-OS test suite starts hundreds of virtual machines, and many of them exercise them quite hard: provoking crashes, rebooting, attaching storage or network devices, or changing boot loader arguments.

With these requirements we absolutely depend on a working /dev/kvm in the test environment, and a performant host to run all these tests in a reasonable time.

Unfortunately, the /dev/kvm requirement is a killer for most public cloud services. Many of them, like GitHub actions, regular AWS EC2 instances, Azure, CircleCI, or Testing Farm don’t offer it at all. So we currently run our main workload on a bunch of under-maintained RHEL 7 servers, and some of it on CentOS CI – the latter is great, it runs OpenShift on bare metal, and we have a “special agreement” with them to let them expose the kvm device into the containers. Pretty well everyone else insists on shoving VMs in between the nodes and Kubernetes, which has traditionally ruined the show.

We can run on EC2 by spinning up a “bare metal” instance when our main machines go down; but for these you really have to pay through the nose (about a hundred dollars per day for a third of our usual capacity), and they don’t really scale too well.

Hence, I was quite excited when I read that the Google Cloud officially supports nested virtualization, at least to a certain degree. So I decided to spend my Day of Learning on investigating that.

Google Cloud Run

I first looked at Cloud Run – structurally, this is exactly what we want: Trigger the run of a few quay.io/cockpit/tasks containers depending on our queue length, and not be bothered by the management of the nodes which run them. My first impression was really good: The documentation and the gcloud command line client are exceedingly well done and helpful. The web UI is a bit crowded, but it’s easy enough to find what you need.

So I had the “hello world” example running in just a few minutes. Building and running our own cockpit/tasks container was also straightforward. However, no /dev/kvm, there is no knob to enable it, and zero Google search hits about how to enable it – so it seems it just can’t be done.

Google Cloud Platform

It seems one really needs to manage the underlying instances manually. GCP offers the create-with-container operation as a nice middle ground:

gcloud compute instances create-with-container kvmtest --container-image=gcr.io/google-containers/busybox \
    --zone us-central1-a --enable-nested-virtualization --min-cpu-platform="Intel Haswell" \
    --container-arg=ls --container-arg='-l' --container-arg=/dev/kvm

That has the magic options to supposedly enable /dev/kvm, but the VM still does not have it (and of course the container does not either). I think the reason is simply that their “container optimized OS” does not have the kvm_intel kernel module. 😢

However, with a standard OS it works:

gcloud compute instances create kvmtest --zone us-central1-a --provisioning-model=SPOT \
    --enable-nested-virtualization --min-cpu-platform="Intel Haswell" \
    --image-project fedora-coreos-cloud --image-family=fedora-coreos-stable

Voilà, /dev/kvm. The default machine type is just rather underpowered for our CI tasks. I spent a good amount of time iterating through all the --machine-types to figure out the two which actually work (n2-standard-* and c2-standard-*), and noticing that what they sell as “8 CPUs” is really just 4. An n2-standard-8 has 4 CPUs and 32 GiB RAM, which is enough to run a single OS validation with 8 parallel tests.

Benchmarking

Then I checked out cockpit and ran a few initial benchmarks, comparing them with a local run on my ThinkPad X1 (Intel i7). In both cases I ran four tests in parallel.

test/image-prepare is by and large a make dist and rpmbuild, i.e. compiler heavy. It takes 3 min 30s on my ThinkPad, and 6min 50s on the GCloud instance. Running all ~ 12 TestAccounts integration checks is similar: 3min 14s local, 6min 36s on GCloud. They run a different workload (Chromium outside on the host, cockpit and OS operations in the VM), but the “2x” slowdown was very similar.

Finally I ran the whole suite. Our real-iron machines run 5 tests in parallel, and shred through all of them in 35 to 45 minutes. On the GCloud instance I ran 8 parallel tests, and it took more than two hours; worse, I got 15 test failures, mostly due to timeouts and repeated crashes. While it ran, I checked top and virsh list every now and then, and the VM felt really sluggish – sometimes virsh hung for a minute, and there are always dozens of zombie processes around which did not get cleaned up for a long time.

I re-did that benchmark for all three working machine types (default with --custom-{cpu,memory} and {c,n}2-standard); there were some minor timing differences (which could also be just noise), but the general feeling of sluggishness was the same.

Conclusion

There is really no replacement for running on bare metal for our kinds of workload.

I’m sure that by reducing parallel tests dramatically, they could eventually succeed; but that would mean to scale up the VMs, and thus both resource/energy usage and cost, beyond what’s reasonable. I was hoping that nested KVM had become better in the last 5 years (last time I seriously tried it), but seems not. 😢

Appendix: Integration with Ansible

While the tests were running, I looked into automating the setup of such an instance. Our cockpituous repo has Ansible playbooks for all our other machines, and there is a module for GCP as well. That documentation leaves something to be desired though, and there are some outright bugs, so it took quite a long time to get something working. Still a bunch of FIXMEs, but as we are not going to use that after all, I just want to keep it here if I ever need to run something else on GCP.

The main playbook ansible/gcp/launch-tasks.yml:

# Requirements:
# ansible-galaxy collection install google.cloud
# dnf install python3-google-auth
# create service account, store credentials in json file, point to it in $GOOGLE_APPLICATION_CREDENTIALS
# Run "gcloud compute ssh <instancename>" once to set up ~/.ssh/google_compute_engine
---
- name: Create tasks runner GCP instance
  hosts: localhost
  gather_facts: false
  vars_files: gcp_defaults.yml

  tasks:
    # https://docs.ansible.com/ansible/latest/collections/google/cloud/gcp_compute_instance_module.html
    # FIXME: spot instance (not supported by Ansible module)
    - name: Create instance
      google.cloud.gcp_compute_instance:
        auth_kind: "{{ gcp_auth_kind }}"
        service_account_file: "{{ lookup('env', 'GOOGLE_APPLICATION_CREDENTIALS') }}"
        zone: "{{ gcp_zone }}"
        project: "{{ gcp_project }}"
        machine_type: c2-standard-8
        # FIXME: parameterize for n_instances, or auto-generate one at random
        name: cockpit-tasks

        disks:
          - auto_delete: true
            boot: true
            initialize_params:
              disk_size_gb: 20
              #image_family: fedora-coreos-stable  # FIXME: does not work, need to search for image name dynamically
              source_image: "projects/fedora-coreos-cloud/global/images/fedora-coreos-35-20220213-3-0-gcp-x86-64"

          # FIXME: add persistent image cache disk

        network_interfaces:
          - access_configs:
              - name: External NAT
                type: ONE_TO_ONE_NAT
      register: gcp

    - name: Add new instance to host group
      add_host:
        hostname: "{{ gcp.networkInterfaces[0].accessConfigs[0].natIP }}"
        groupname: launched
        ansible_user: "core"
        ansible_become: yes
        ansible_ssh_private_key_file: "{{ lookup('env', 'HOME') }}/.ssh/google_compute_engine"
        ansible_ssh_common_args: "-o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o CheckHostIP=no"

# from here on it's identical to ansible/aws/launch-tasks.yml

And the referenced ansible/gcp/gcp_defaults.yml:

gcp_zone: us-central1-a
# FIXME: "My First Project", change to production one
gcp_project: organic-airship-343105
gcp_auth_kind: serviceaccount