cockpit-error-budget.md

   1 ---
   2 title: Tame your CI with Error Budgets
   3 subtitle:
   4 author: Martin Pitt <<mpitt@redhat.com>>
   5 email: mpitt@redhat.com
   6 date: DevConv.CZ 2022
   7 theme: Singapore
   8 header-includes:
   9  - \setbeameroption{show notes}
  10  - \hypersetup{colorlinks=true}
  11 ...
  12
  13 # Applying Error Budgets to a software project
  14
  15 Product, not a service: `apt/dnf install cockpit`
  16
  17 Internal CI service: provider = customer
  18
  19 :::notes
  20 - Martin Pitt, lead Cockpit team at Red Hat
  21 - Thanks Stef for intro
  22 - when first heared, error budgets did not seem to apply: cockpit is a sw product, not a service
  23 - we do use web services internally, machines and OpenShift to run our tests; SLOs do apply
  24 - internal service, we are our own provider and customer; tight feedback loop, no blame game
  25 :::
  26
  27 # Path to the Dark Side
  28
  29 Unstable tests…
  30
  31 plus API timeouts, noisy neighbors, random service crashes…
  32
  33 \qquad ![retry button](./retry-button.png){width=100} \qquad \qquad ![exploding head](./exploding-head.png){width=100}\
  34
  35 \vspace{2em}
  36
  37 \centerline{\Large FEAR!}
  38
  39 :::notes
  40 - Had long phases of slowly deteriorating tests or unstable infra, got used to hitting the retry button until stuff passed
  41 - Frustrating, hard to land stuff, afraid to touch code with known-unstable tests
  42 - Hides real-world problems: many bugs are in the tests themselves, but in a lot of cases they show bugs in the product or dependencies
  43 - No systematic prevention of introducing new unstable tests
  44 - Occasionally did a “clean up some mess” sprint; hard to know what to look at first
  45 :::
  46
  47 # "Always pass" is unrealistic
  48
  49 Integration test complexity for each PR push:
  50
  51 - 10 OS distros/versions
  52 - 300 tests for each OS, in VMs/browser
  53 - exercise > 100 OS APIs
  54
  55 :::notes
  56 - We had to realize that "always pass" is not an attainable, nor a good goal,
  57     *even* if you ignore flaky infrastructure
  58 - The numbers here give you some idea about how many moving parts are involved
  59   for testing a PR
  60 - Add to that unreliable timing due to noisy cloud neighbors, OSes doing
  61   stuff in the background
  62 :::
  63
  64 . . .
  65
  66 Failure classes:
  67
  68 - Our own product/tests
  69 - Operating System bugs
  70 - Infrastructure
  71
  72 :::notes
  73 - Have to differ between bugs in our product and tests (our control, fix), bugs in the OS (report, track, skip), and failures of our infra (retry justified and unavoidable)
  74 :::
  75
  76
  77 # How to get back to happiness
  78
  79 - Define Error Budget "good enough" from top-level experience
  80 - Translate to SLI/SLO
  81 - Avoid introducing *new* unstable tests
  82 - Gracefully handle *old* unstable tests and inevitable noise
  83
  84 :::notes
  85 - Become systematic and objective: Define goals what keeps us happy, an budget for how much failure we are
  86   ready to tolerate
  87 - Translate these into SLI/SLO, drill down into specifics
  88 - Implement measuring and evaluation of SLIs
  89 - most important aspect here: Define a strategy how to deal with test failures
  90   sensibly
  91 :::
  92
  93 # What keeps us happy
  94
  95 [Goals and SLOs on our wiki](https://github.com/cockpit-project/cockpit/wiki/DevelopmentPrinciples#our-testsci-error-budget)
  96
  97 - PRs get validated in reasonable time
  98 - Meaningful test failures
  99 - Don't fear touching code
 100
 101 :::notes
 102 - Meeting with the team to discuss what keeps our velocity and motivation
 103 - PRs get test results reliably, and get validated in a reasonable time; includes queue + test run time
 104 - Test failures are relevant and meaningful. Humans don’t waste time on interpreting unstable test results to check if "unrelated" or "relevant"
 105 - We are not afraid of touching code
 106 - Written down on our public wiki page, for some commitment
 107 - Formulated SLOs to define what we mean exactly with these goals
 108 :::
 109
 110 # Service Level Objectives
 111
 112 One test reliability SLO:
 113
 114 > A merged PR became fully green with a 75% chance at the first attempt, and with a 95% chance after one retry
 115
 116 One infra reliability SLO:
 117
 118 > 95% of test runs spend no more than 5 minutes in the queue until they get assigned to a runner
 119
 120 :::notes
 121 - also on that wiki page are six SLOs; they define measurable properties with an
 122   objective that implements aspects of our goals
 123 - one example for test reliability, one for infrastructure reliability
 124 :::
 125
 126 # Implementation of SLIs
 127
 128 GitHub `/statuses` API for submitting PR:
 129
 130 ```json
 131 0: {
 132  "state": "pending",
 133  "description": "Not yet tested",
 134  "context": "debian-stable",
 135  "created_at": "2022-01-10T11:08:05Z"
 136 }
 137 ```
 138
 139 :::notes
 140 - Fortunately, almost all of the required data can be derived from the GitHub statuses API
 141 - This is the initial status when submitting a PR
 142 :::
 143
 144
 145 ----
 146
 147 Picked up by a worker:
 148
 149 ```json
 150 1: {
 151   "state": "pending",
 152   "description": "Testing in progress [4-ci-srv-05]",
 153   "target_url": "https://logs.cockpit-project.org/...",
 154   "context": "debian-stable",
 155   "created_at": "2022-01-10T11:08:30Z"
 156 }
 157 ```
 158
 159 :::notes
 160 - once a bot picks up the pending test request, it will change it to "in progress"
 161 - time delta in created_at gives you the time it spent in the queue; for the second SLI that I mentioned
 162 :::
 163
 164 ----
 165
 166 Failure:
 167
 168 ```json
 169 2: {
 170   "state": "failure",
 171   "description": "Tests failed with code 3",
 172   "target_url": "https://logs...",
 173   "context": "debian-stable",
 174   "created_at": "2022-01-10T11:47:54Z"
 175 }
 176 ```
 177
 178 [store-tests script](https://github.com/cockpit-project/bots/blob/main/store-tests)
 179
 180 [export in Prometheus format](https://logs-https-frontdoor.apps.ocp.ci.centos.org/prometheus)
 181
 182 :::notes
 183 - once test finishes, state changes to success or failure
 184 - statuses API remembers the *whole* history; if a fail goes back to
 185     "in progress" and eventually "success", it was a retry
 186 - we have a store-tests script which reads and interprets this history for a
 187     merged PR and puts it into an SQLite database; link is on the slide
 188 - regularly do SQL queries to calculate the current SLIs, export them in
 189     Prometheus text format, and let a Prometheus instance pick them up to store
 190     the whole history
 191 :::
 192
 193 # SLOs in Grafana
 194
 195 [Grafana SLOs](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/ci/cockpit-ci?orgId=1)\
 196
 197 ![Grafana SLOs screenshot](./grafana-SLOs.png)\
 198
 199 :::notes
 200 - We have a Grafana instance which graphs all SLIs/SLOs; you can move around in
 201     time and investigate problem spots more closely
 202 - Link on the slide
 203 - Red bars show the SLO, i.e. where the indicator exceeds the expectation and
 204     starts to eat into error budget
 205 - Interesting real-time data, not a sufficient view for how much of our error budget we used up in the last month
 206 :::
 207
 208 # Error Budget in Grafana
 209
 210 [Grafana Error Budget](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/budget/cockpit-ci-error-budget)
 211
 212 ![Screenshot of Grafana PR retry budget](./grafana-retry-merge-budget.png)\
 213
 214 :::notes
 215 - For that, we have another set of graphs which shows the error budget usage of
 216     the last 30 days; again, link on the slide
 217 - This is the budget for our first SLO about merging a PR with retried tests
 218 - We are still good as per our own goal, but will most likely exhaust the
 219   budget in the next days, so need to take action soon
 220 :::
 221
 222 ----
 223
 224 ![Screenshot of Grafana queue time budget](./grafana-queue-time-budget.png)\
 225
 226 :::notes
 227 - The budget for the other mentioned SLO is the queue time; it's fine normally,
 228     but it exploded when the Westford data center went down. This hosts our
 229     main workload, and the only place which can run Red Hat internal tests.
 230 - Normally we manually spin up a fallback in EC2, but it happened right at the
 231   start of the EOY holidays, so nobody cared much -- it was mostly just
 232   pending PRs were just automated housekeeping which were not urgent
 233 :::
 234
 235 # Test reliability
 236
 237 3x auto-retry unaffected tests: ![carrott](./carrott.png){height=20}\
 238
 239 \vfill
 240
 241 :::notes
 242 - goal “don’t retry PRs too often” is emergent result of the hundreds of individual tests that run on each PR
 243 - As explained before, we can't expect 100% success due to random noise, so introduce concept of "affected test": if a PR changes the code which a test covers, or the test itself, that test is affected
 244 - Auto-retry unaffected tests up to 3x; that's the carrott, and it made our lifes dramatically better
 245 :::
 246
 247 . . .
 248
 249 3x pass affected tests: ![stick](./stick.png){height=10}\
 250
 251 \vfill
 252
 253 :::notes
 254 - not sufficient: introduce new flaky tests, overall quality deteriorates, and soon enough not even 3 retries will be enough
 255 - Stick: affected tests pass 3x; prevent introducing broken tests
 256 :::
 257
 258 . . .
 259
 260 Track tests which fail in more than 10% of cases:
 261
 262 \qquad \quad ![Screenshot of Grafana unstable tests](./grafana-unstable-tests.png){height=30%}\
 263
 264 :::notes
 265 - track tests which fail too often; some base failure rate of few %, but that random noise should distribute evenly across all tests
 266 - The ones which fail more than 10% of times are the ones breaking PRs even with auto-retry; need to investigate/fix these
 267 :::
 268
 269 # Next steps
 270
 271 - Automatic notifications
 272 - Regular SLO review/adjustment
 273 - Automatic infra fallback?
 274 - Error budget team mode
 275
 276 :::notes
 277 - Pretty happy with this overall; in latest poll of the team, everyone said that they are not feeling blocked by/scared of PRs/tests any more, and productivity and turnaround time is fine
 278 - main missing thing: Add notification/escalation from Grafana once error budgets are decreasing and too close to the limit, or even above it
 279 - Regularly review and adjust the SLOs to our current feeling of happiness; goals might need to get tighter, or possibly also relaxed; if an SLO is too strict, violated, but nobody cares, don't spend time on it
 280 - Maybe find and set up an automatic fallback if our main data center fails; it's a single Ansible playbook, and fallback is expensive, so ok to leave this under human control; will ruin stats, but not actually a pain point
 281 - More formal process for going into error budget fixing mode
 282 :::
 283
 284
 285 # Q & A
 286
 287 Contact:
 288
 289 - `#cockpit` on libera.chat
 290 - [https://cockpit-project.org](https://cockpit-project.org)
 291
 292 :::notes
 293 - Home page leads to mailing lists, documentation
 294 - thanks for your attention; Q+A
 295 :::