X-Git-Url: https://piware.de/gitweb/?p=talk-cockpit-error-budgets.git;a=blobdiff_plain;f=cockpit-error-budget.md;h=9b306e1c4da50b7abfa3bc1996bcdb8f8e828180;hp=3189ec6fc6666e138f2c6c14939a1cb10ab125ce;hb=6029e2016ad530ea52e4336d4ab96b2ae2022cb7;hpb=fe853ed7fa3279a66f0e2dc771f296d55e963f50 diff --git a/cockpit-error-budget.md b/cockpit-error-budget.md index 3189ec6..9b306e1 100644 --- a/cockpit-error-budget.md +++ b/cockpit-error-budget.md @@ -3,10 +3,11 @@ title: Tame your CI with Error Budgets subtitle: author: Martin Pitt <> email: mpitt@redhat.com -date: DevConv.CZ 2022 +date: DevConv.CZ 2022 `\\ \vspace{2em} \includegraphics[width=80px]{./martinpitt.png}`{=latex} theme: Singapore header-includes: - \setbeameroption{show notes} + - \hypersetup{colorlinks=true} ... # Applying Error Budgets to a software project @@ -17,12 +18,268 @@ Internal CI service: provider = customer :::notes - Martin Pitt, lead Cockpit team at Red Hat -- at first sight, error budgets did not apply: cockpit is a product, not a service; thought about this -- we do use web services internally, machines and OpenShift to run our CI -- abstracting, running our tests and their results is a kind of service +- Thanks Stef for intro +- when first heared, error budgets did not seem to apply: cockpit is a sw product, not a service +- we do use web services internally, machines and OpenShift to run our tests; SLOs do apply - internal service, we are our own provider and customer; tight feedback loop, no blame game ::: +# Path to the Dark Side + +Unstable tests… + +plus API timeouts, noisy neighbors, random service crashes… + +\qquad ![retry button](./retry-button.png){width=100} \qquad \qquad ![exploding head](./exploding-head.png){width=100}\ + +\vspace{2em} + +\centerline{\Large FEAR!} + +:::notes +- Had long phases of slowly deteriorating tests or unstable infra, got used to hitting the retry button until stuff passed +- Frustrating, hard to land stuff, afraid to touch code with known-unstable tests +- Hides real-world problems: many bugs are in the tests themselves, but in a lot of cases they show bugs in the product or dependencies +- No systematic prevention of introducing new unstable tests +- Occasionally did a “clean up some mess” sprint; hard to know what to look at first +::: + +# "Always pass" is unrealistic + +Integration test complexity for each PR push: + +- 10 OS distros/versions +- 300 tests for each OS, in VMs/browser +- exercise > 100 OS APIs + +:::notes +- We had to realize that "always pass" is not an attainable, nor a good goal, + *even* if you ignore flaky infrastructure +- The numbers here give you some idea about how many moving parts are involved + for testing a PR +- Add to that unreliable timing due to noisy cloud neighbors, OSes doing + stuff in the background +::: + +. . . + +Failure classes: + +- Our own product/tests +- Operating System bugs +- Infrastructure + +:::notes +- Have to differ between bugs in our product and tests (our control, fix), bugs in the OS (report, track, skip), and failures of our infra (retry justified and unavoidable) +::: + + +# How to get back to happiness + +- Define Error Budget "good enough" from top-level experience +- Translate to SLI/SLO +- Avoid introducing *new* unstable tests +- Gracefully handle *old* unstable tests and inevitable noise + +:::notes +- Become systematic and objective: Define goals what keeps us happy, an budget for how much failure we are + ready to tolerate +- Translate these into SLI/SLO, drill down into specifics +- Implement measuring and evaluation of SLIs +- most important aspect here: Define a strategy how to deal with test failures + sensibly +::: + +# What keeps us happy + +[Goals and SLOs on our wiki](https://github.com/cockpit-project/cockpit/wiki/DevelopmentPrinciples#our-testsci-error-budget) + +- PRs get validated in reasonable time +- Meaningful test failures +- Don't fear touching code + +:::notes +- Meeting with the team to discuss what keeps our velocity and motivation +- PRs get test results reliably, and get validated in a reasonable time; includes queue + test run time +- Test failures are relevant and meaningful. Humans don’t waste time on interpreting unstable test results to check if "unrelated" or "relevant" +- We are not afraid of touching code +- Written down on our public wiki page, for some commitment +- Formulated SLOs to define what we mean exactly with these goals +::: + +# Service Level Objectives + +One test reliability SLO: + +> A merged PR became fully green with a 75% chance at the first attempt, and with a 95% chance after one retry + +One infra reliability SLO: + +> 95% of test runs spend no more than 5 minutes in the queue until they get assigned to a runner + +:::notes +- also on that wiki page are six SLOs; they define measurable properties with an + objective that implements aspects of our goals +- one example for test reliability, one for infrastructure reliability +::: + +# Implementation of SLIs + +GitHub `/statuses` API for submitting PR: + +```json +0: { + "state": "pending", + "description": "Not yet tested", + "context": "debian-stable", + "created_at": "2022-01-10T11:08:05Z" +} +``` + +:::notes +- Fortunately, almost all of the required data can be derived from the GitHub statuses API +- This is the initial status when submitting a PR +::: + + +---- + +Picked up by a worker: + +```json +1: { + "state": "pending", + "description": "Testing in progress [4-ci-srv-05]", + "target_url": "https://logs.cockpit-project.org/...", + "context": "debian-stable", + "created_at": "2022-01-10T11:08:30Z" +} +``` + +:::notes +- once a bot picks up the pending test request, it will change it to "in progress" +- time delta in created_at gives you the time it spent in the queue; for the second SLI that I mentioned +::: + +---- + +Failure: + +```json +2: { + "state": "failure", + "description": "Tests failed with code 3", + "target_url": "https://logs...", + "context": "debian-stable", + "created_at": "2022-01-10T11:47:54Z" +} +``` + +[store-tests script](https://github.com/cockpit-project/bots/blob/main/store-tests) + +[export in Prometheus format](https://logs-https-frontdoor.apps.ocp.ci.centos.org/prometheus) + +:::notes +- once test finishes, state changes to success or failure +- statuses API remembers the *whole* history; if a fail goes back to + "in progress" and eventually "success", it was a retry +- we have a store-tests script which reads and interprets this history for a + merged PR and puts it into an SQLite database; link is on the slide +- regularly do SQL queries to calculate the current SLIs, export them in + Prometheus text format, and let a Prometheus instance pick them up to store + the whole history +::: + +# SLOs in Grafana + +[Grafana SLOs](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/ci/cockpit-ci?orgId=1)\ + +![Grafana SLOs screenshot](./grafana-SLOs.png)\ + +:::notes +- We have a Grafana instance which graphs all SLIs/SLOs; you can move around in + time and investigate problem spots more closely +- Link on the slide +- Red bars show the SLO, i.e. where the indicator exceeds the expectation and + starts to eat into error budget +- Interesting real-time data, not a sufficient view for how much of our error budget we used up in the last month +::: + +# Error Budget in Grafana + +[Grafana Error Budget](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/budget/cockpit-ci-error-budget) + +![Screenshot of Grafana PR retry budget](./grafana-retry-merge-budget.png)\ + +:::notes +- For that, we have another set of graphs which shows the error budget usage of + the last 30 days; again, link on the slide +- This is the budget for our first SLO about merging a PR with retried tests +- We are still good as per our own goal, but will most likely exhaust the + budget in the next days, so need to take action soon +::: + +---- + +![Screenshot of Grafana queue time budget](./grafana-queue-time-budget.png)\ + +:::notes +- The budget for the other mentioned SLO is the queue time; it's fine normally, + but it exploded when the Westford data center went down. This hosts our + main workload, and the only place which can run Red Hat internal tests. +- Normally we manually spin up a fallback in EC2, but it happened right at the + start of the EOY holidays, so nobody cared much -- it was mostly just + pending PRs were just automated housekeeping which were not urgent +::: + +# Test reliability + +3x auto-retry unaffected tests: ![carrott](./carrott.png){height=20}\ + +\vfill + +:::notes +- goal “don’t retry PRs too often” is emergent result of the hundreds of individual tests that run on each PR +- As explained before, we can't expect 100% success due to random noise, so introduce concept of "affected test": if a PR changes the code which a test covers, or the test itself, that test is affected +- Auto-retry unaffected tests up to 3x; that's the carrott, and it made our lifes dramatically better +::: + +. . . + +3x pass affected tests: ![stick](./stick.png){height=10}\ + +\vfill + +:::notes +- not sufficient: introduce new flaky tests, overall quality deteriorates, and soon enough not even 3 retries will be enough +- Stick: affected tests pass 3x; prevent introducing broken tests +::: + +. . . + +Track tests which fail in more than 10% of cases: + +\qquad \quad ![Screenshot of Grafana unstable tests](./grafana-unstable-tests.png){height=30%}\ + +:::notes +- track tests which fail too often; some base failure rate of few %, but that random noise should distribute evenly across all tests +- The ones which fail more than 10% of times are the ones breaking PRs even with auto-retry; need to investigate/fix these +::: + +# Next steps + +- Automatic notifications +- Regular SLO review/adjustment +- Automatic infra fallback? +- Error budget team mode + +:::notes +- Pretty happy with this overall; in latest poll of the team, everyone said that they are not feeling blocked by/scared of PRs/tests any more, and productivity and turnaround time is fine +- main missing thing: Add notification/escalation from Grafana once error budgets are decreasing and too close to the limit, or even above it +- Regularly review and adjust the SLOs to our current feeling of happiness; goals might need to get tighter, or possibly also relaxed; if an SLO is too strict, violated, but nobody cares, don't spend time on it +- Maybe find and set up an automatic fallback if our main data center fails; it's a single Ansible playbook, and fallback is expensive, so ok to leave this under human control; will ruin stats, but not actually a pain point +- More formal process for going into error budget fixing mode +::: # Q & A @@ -30,7 +287,7 @@ Internal CI service: provider = customer Contact: - `#cockpit` on libera.chat -- https://cockpit-project.org +- [https://cockpit-project.org](https://cockpit-project.org) :::notes - Home page leads to mailing lists, documentation