Add photo

[talk-cockpit-error-budgets.git] / cockpit-error-budget.md
diff --git a/cockpit-error-budget.md b/cockpit-error-budget.md

index 3189ec6fc6666e138f2c6c14939a1cb10ab125ce..9b306e1c4da50b7abfa3bc1996bcdb8f8e828180 100644 (file)
--- a/cockpit-error-budget.md
+++ b/cockpit-error-budget.md
@@ -3,10 +3,11 @@ title: Tame your CI with Error Budgets
  subtitle:
  author: Martin Pitt <<mpitt@redhat.com>>
  email: mpitt@redhat.com
-date: DevConv.CZ 2022
+date: DevConv.CZ 2022 `\\ \vspace{2em} \includegraphics[width=80px]{./martinpitt.png}`{=latex}
  theme: Singapore
  header-includes:
   - \setbeameroption{show notes}
+ - \hypersetup{colorlinks=true}
  ...
  
  # Applying Error Budgets to a software project
@@ -17,12 +18,268 @@ Internal CI service: provider = customer
  
  :::notes
  - Martin Pitt, lead Cockpit team at Red Hat
-- at first sight, error budgets did not apply: cockpit is a product, not a service; thought about this
-- we do use web services internally, machines and OpenShift to run our CI
-- abstracting, running our tests and their results is a kind of service
+- Thanks Stef for intro
+- when first heared, error budgets did not seem to apply: cockpit is a sw product, not a service
+- we do use web services internally, machines and OpenShift to run our tests; SLOs do apply
  - internal service, we are our own provider and customer; tight feedback loop, no blame game
  :::
  
+# Path to the Dark Side
+
+Unstable tests…
+
+plus API timeouts, noisy neighbors, random service crashes…
+
+\qquad ![retry button](./retry-button.png){width=100} \qquad \qquad ![exploding head](./exploding-head.png){width=100}\
+
+\vspace{2em}
+
+\centerline{\Large FEAR!}
+
+:::notes
+- Had long phases of slowly deteriorating tests or unstable infra, got used to hitting the retry button until stuff passed
+- Frustrating, hard to land stuff, afraid to touch code with known-unstable tests
+- Hides real-world problems: many bugs are in the tests themselves, but in a lot of cases they show bugs in the product or dependencies
+- No systematic prevention of introducing new unstable tests
+- Occasionally did a “clean up some mess” sprint; hard to know what to look at first
+:::
+
+# "Always pass" is unrealistic
+
+Integration test complexity for each PR push:
+
+- 10 OS distros/versions
+- 300 tests for each OS, in VMs/browser
+- exercise > 100 OS APIs
+
+:::notes
+- We had to realize that "always pass" is not an attainable, nor a good goal,
+    *even* if you ignore flaky infrastructure
+- The numbers here give you some idea about how many moving parts are involved
+  for testing a PR
+- Add to that unreliable timing due to noisy cloud neighbors, OSes doing
+  stuff in the background
+:::
+
+. . .
+
+Failure classes:
+
+- Our own product/tests
+- Operating System bugs
+- Infrastructure
+
+:::notes
+- Have to differ between bugs in our product and tests (our control, fix), bugs in the OS (report, track, skip), and failures of our infra (retry justified and unavoidable)
+:::
+
+
+# How to get back to happiness
+
+- Define Error Budget "good enough" from top-level experience
+- Translate to SLI/SLO
+- Avoid introducing *new* unstable tests
+- Gracefully handle *old* unstable tests and inevitable noise
+
+:::notes
+- Become systematic and objective: Define goals what keeps us happy, an budget for how much failure we are
+  ready to tolerate
+- Translate these into SLI/SLO, drill down into specifics
+- Implement measuring and evaluation of SLIs
+- most important aspect here: Define a strategy how to deal with test failures
+  sensibly
+:::
+
+# What keeps us happy
+
+[Goals and SLOs on our wiki](https://github.com/cockpit-project/cockpit/wiki/DevelopmentPrinciples#our-testsci-error-budget)
+
+- PRs get validated in reasonable time
+- Meaningful test failures
+- Don't fear touching code
+
+:::notes
+- Meeting with the team to discuss what keeps our velocity and motivation
+- PRs get test results reliably, and get validated in a reasonable time; includes queue + test run time
+- Test failures are relevant and meaningful. Humans don’t waste time on interpreting unstable test results to check if "unrelated" or "relevant"
+- We are not afraid of touching code
+- Written down on our public wiki page, for some commitment
+- Formulated SLOs to define what we mean exactly with these goals
+:::
+
+# Service Level Objectives
+
+One test reliability SLO:
+
+> A merged PR became fully green with a 75% chance at the first attempt, and with a 95% chance after one retry
+
+One infra reliability SLO:
+
+> 95% of test runs spend no more than 5 minutes in the queue until they get assigned to a runner
+
+:::notes
+- also on that wiki page are six SLOs; they define measurable properties with an
+  objective that implements aspects of our goals
+- one example for test reliability, one for infrastructure reliability
+:::
+
+# Implementation of SLIs
+
+GitHub `/statuses` API for submitting PR:
+
+```json
+0: {
+ "state": "pending",
+ "description": "Not yet tested",
+ "context": "debian-stable",
+ "created_at": "2022-01-10T11:08:05Z"
+}
+```
+
+:::notes
+- Fortunately, almost all of the required data can be derived from the GitHub statuses API
+- This is the initial status when submitting a PR
+:::
+
+
+----
+
+Picked up by a worker:
+
+```json
+1: {
+  "state": "pending",
+  "description": "Testing in progress [4-ci-srv-05]",
+  "target_url": "https://logs.cockpit-project.org/...",
+  "context": "debian-stable",
+  "created_at": "2022-01-10T11:08:30Z"
+}
+```
+
+:::notes
+- once a bot picks up the pending test request, it will change it to "in progress"
+- time delta in created_at gives you the time it spent in the queue; for the second SLI that I mentioned
+:::
+
+----
+
+Failure:
+
+```json
+2: {
+  "state": "failure",
+  "description": "Tests failed with code 3",
+  "target_url": "https://logs...",
+  "context": "debian-stable",
+  "created_at": "2022-01-10T11:47:54Z"
+}
+```
+
+[store-tests script](https://github.com/cockpit-project/bots/blob/main/store-tests)
+
+[export in Prometheus format](https://logs-https-frontdoor.apps.ocp.ci.centos.org/prometheus)
+
+:::notes
+- once test finishes, state changes to success or failure
+- statuses API remembers the *whole* history; if a fail goes back to
+    "in progress" and eventually "success", it was a retry
+- we have a store-tests script which reads and interprets this history for a
+    merged PR and puts it into an SQLite database; link is on the slide
+- regularly do SQL queries to calculate the current SLIs, export them in
+    Prometheus text format, and let a Prometheus instance pick them up to store
+    the whole history
+:::
+
+# SLOs in Grafana
+
+[Grafana SLOs](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/ci/cockpit-ci?orgId=1)\
+
+![Grafana SLOs screenshot](./grafana-SLOs.png)\
+
+:::notes
+- We have a Grafana instance which graphs all SLIs/SLOs; you can move around in
+    time and investigate problem spots more closely
+- Link on the slide
+- Red bars show the SLO, i.e. where the indicator exceeds the expectation and
+    starts to eat into error budget
+- Interesting real-time data, not a sufficient view for how much of our error budget we used up in the last month
+:::
+
+# Error Budget in Grafana
+
+[Grafana Error Budget](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/budget/cockpit-ci-error-budget)
+
+![Screenshot of Grafana PR retry budget](./grafana-retry-merge-budget.png)\
+
+:::notes
+- For that, we have another set of graphs which shows the error budget usage of
+    the last 30 days; again, link on the slide
+- This is the budget for our first SLO about merging a PR with retried tests
+- We are still good as per our own goal, but will most likely exhaust the
+  budget in the next days, so need to take action soon
+:::
+
+----
+
+![Screenshot of Grafana queue time budget](./grafana-queue-time-budget.png)\
+
+:::notes
+- The budget for the other mentioned SLO is the queue time; it's fine normally,
+    but it exploded when the Westford data center went down. This hosts our
+    main workload, and the only place which can run Red Hat internal tests.
+- Normally we manually spin up a fallback in EC2, but it happened right at the
+  start of the EOY holidays, so nobody cared much -- it was mostly just
+  pending PRs were just automated housekeeping which were not urgent
+:::
+
+# Test reliability
+
+3x auto-retry unaffected tests: ![carrott](./carrott.png){height=20}\
+
+\vfill
+
+:::notes
+- goal “don’t retry PRs too often” is emergent result of the hundreds of individual tests that run on each PR
+- As explained before, we can't expect 100% success due to random noise, so introduce concept of "affected test": if a PR changes the code which a test covers, or the test itself, that test is affected
+- Auto-retry unaffected tests up to 3x; that's the carrott, and it made our lifes dramatically better
+:::
+
+. . .
+
+3x pass affected tests: ![stick](./stick.png){height=10}\
+
+\vfill
+
+:::notes
+- not sufficient: introduce new flaky tests, overall quality deteriorates, and soon enough not even 3 retries will be enough
+- Stick: affected tests pass 3x; prevent introducing broken tests
+:::
+
+. . .
+
+Track tests which fail in more than 10% of cases:
+
+\qquad \quad ![Screenshot of Grafana unstable tests](./grafana-unstable-tests.png){height=30%}\
+
+:::notes
+- track tests which fail too often; some base failure rate of few %, but that random noise should distribute evenly across all tests
+- The ones which fail more than 10% of times are the ones breaking PRs even with auto-retry; need to investigate/fix these
+:::
+
+# Next steps
+
+- Automatic notifications
+- Regular SLO review/adjustment
+- Automatic infra fallback?
+- Error budget team mode
+
+:::notes
+- Pretty happy with this overall; in latest poll of the team, everyone said that they are not feeling blocked by/scared of PRs/tests any more, and productivity and turnaround time is fine
+- main missing thing: Add notification/escalation from Grafana once error budgets are decreasing and too close to the limit, or even above it
+- Regularly review and adjust the SLOs to our current feeling of happiness; goals might need to get tighter, or possibly also relaxed; if an SLO is too strict, violated, but nobody cares, don't spend time on it
+- Maybe find and set up an automatic fallback if our main data center fails; it's a single Ansible playbook, and fallback is expensive, so ok to leave this under human control; will ruin stats, but not actually a pain point
+- More formal process for going into error budget fixing mode
+:::
  
  
  # Q & A
@@ -30,7 +287,7 @@ Internal CI service: provider = customer
  Contact:
  
  - `#cockpit` on libera.chat
-- https://cockpit-project.org
+- [https://cockpit-project.org](https://cockpit-project.org)
  
  :::notes
  - Home page leads to mailing lists, documentation