Revisiting Google Cloud Performance for KVM-based CI

Summary from 2022

Back then, I evaluated Google Cloud Platform for running Cockpit’s integration tests. Nested virtualization on GCE was way too slow, crashy, and unreliable for our workload. Tests that ran in 35-45 minutes on bare metal (my laptop) took over 2 hours with 15 failures, timeouts, and crashes. The nested KVM simply wasn’t performant enough.

On today’s Day of Learning, I gave this another shot, and was pleasantly surprised.

Testing the Landscape

I started by checking AWS EC2 again. Sadly, there is still no support for nested virtualization, not even inofficial. That only works on bare-metal instances, which are too expensive and inelastic for CI needs.

Next step was seeing whether Google’s higher-level services had finally gained /dev/kvm support. I tried Cloud Run, Batch, and GKE, but none of them expose the KVM device to containers. This remains disappointing, as it would be ideal to not worry about instance management at all.

The only supported nested virtualization is with the lower-level Compute Engine directly, which requires managing the VM ourselves. The create-with-container command is deprecated, so I went straight to regular instances with Fedora CoreOS.

GCE Compute Performance

I ran into lots of failure paths, so I had to experiment with machine types, CPU platforms, fight libvirt issues, etc. But this works nicely:

gcloud compute instances create kvmtest-c4 --zone=us-central1-a \
          --machine-type=c4-highcpu-8 \
          --provisioning-model=SPOT \
          --enable-nested-virtualization \
          --boot-disk-size=20GB \
          --image-project fedora-coreos-cloud \
          --image-family=fedora-coreos-stable

Running our standard cockpit-files test suite:

My Laptop (ThinkPad X1) from 2020 (8 Hyperthread CPUs, 16 GiB RAM):

  • make prepare-check: 1m5s
  • make check (4 VMs in parallel): 5m18s

GCE n2-standard-8 (4 CPUs, 32 GiB RAM):

  • make prepare-check: 1m22s
  • make check (2 VMs in parallel): 6m31s

GCE c4-highcpu-8 (8 CPUs, 16 GiB RAM):

  • make check (4 VMs in parallel): 2m52s

All of these ran successfully without any of the issues I had in 2022.

With the main cockpit repo:

Laptop:

  • test/image-prepare: 4m20s
  • test/common/run-tests --no-retry-fail --test-dir test/verify TestAccounts: 3m47s

GCE n2-standard-8:

  • test/image-prepare: 4m36s
  • test/common/run-tests --no-retry-fail --test-dir test/verify TestAccounts: 3m18s

(I didn’t test on the c4-highcpu instance)

So even with the slower and underpowered n2-standard-8 type, nested virtualization is now actually slightly faster than my laptop for the integration tests.

Cost Analysis

I used the GCE cost calculator to estimate what Cockpit’s CI would cost if we ran it on GCE. For a real deployment we need a persistent 100 GB cache disk, bug that’s negligible (around 20 USD/month). If we assume around 20 instances running on average (we need about 50 on work days, but much fewer on weekends and European nights), we get around 2200 USD/month.

Conclusion

Nested virtualization on Google Compute Engine has matured dramatically since 2022. This makes GCE a genuinely viable option for VM-heavy CI workloads.