2 title: Tame your CI with Error Budgets
4 author: Martin Pitt <<mpitt@redhat.com>>
5 email: mpitt@redhat.com
9 - \setbeameroption{show notes}
10 - \hypersetup{colorlinks=true}
13 # Applying Error Budgets to a software project
15 Product, not a service: `apt/dnf install cockpit`
17 Internal CI service: provider = customer
20 - Martin Pitt, lead Cockpit team at Red Hat
21 - Thanks Stef for intro
22 - when first heared, error budgets did not seem to apply: cockpit is a sw product, not a service
23 - we do use web services internally, machines and OpenShift to run our tests; SLOs do apply
24 - internal service, we are our own provider and customer; tight feedback loop, no blame game
27 # Path to the Dark Side
31 plus API timeouts, noisy neighbors, random service crashes…
33 \qquad {width=100} \qquad \qquad {width=100}\
37 \centerline{\Large FEAR!}
40 - Had long phases of slowly deteriorating tests or unstable infra, got used to hitting the retry button until stuff passed
41 - Frustrating, hard to land stuff, afraid to touch code with known-unstable tests
42 - Hides real-world problems: many bugs are in the tests themselves, but in a lot of cases they show bugs in the product or dependencies
43 - No systematic prevention of introducing new unstable tests
44 - Occasionally did a “clean up some mess” sprint; hard to know what to look at first
47 # "Always pass" is unrealistic
49 Integration test complexity for each PR push:
51 - 10 OS distros/versions
52 - 300 tests for each OS, in VMs/browser
53 - exercise > 100 OS APIs
56 - We had to realize that "always pass" is not an attainable, nor a good goal,
57 *even* if you ignore flaky infrastructure
58 - The numbers here give you some idea about how many moving parts are involved
60 - Add to that unreliable timing due to noisy cloud neighbors, OSes doing
61 stuff in the background
68 - Our own product/tests
69 - Operating System bugs
73 - Have to differ between bugs in our product and tests (our control, fix), bugs in the OS (report, track, skip), and failures of our infra (retry justified and unavoidable)
77 # How to get back to happiness
79 - Define Error Budget "good enough" from top-level experience
80 - Translate to SLI/SLO
81 - Avoid introducing *new* unstable tests
82 - Gracefully handle *old* unstable tests and inevitable noise
85 - Become systematic and objective: Define goals what keeps us happy, an budget for how much failure we are
87 - Translate these into SLI/SLO, drill down into specifics
88 - Implement measuring and evaluation of SLIs
89 - most important aspect here: Define a strategy how to deal with test failures
95 [Goals and SLOs on our wiki](https://github.com/cockpit-project/cockpit/wiki/DevelopmentPrinciples#our-testsci-error-budget)
97 - PRs get validated in reasonable time
98 - Meaningful test failures
99 - Don't fear touching code
102 - Meeting with the team to discuss what keeps our velocity and motivation
103 - PRs get test results reliably, and get validated in a reasonable time; includes queue + test run time
104 - Test failures are relevant and meaningful. Humans don’t waste time on interpreting unstable test results to check if "unrelated" or "relevant"
105 - We are not afraid of touching code
106 - Written down on our public wiki page, for some commitment
107 - Formulated SLOs to define what we mean exactly with these goals
110 # Service Level Objectives
112 One test reliability SLO:
114 > A merged PR became fully green with a 75% chance at the first attempt, and with a 95% chance after one retry
116 One infra reliability SLO:
118 > 95% of test runs spend no more than 5 minutes in the queue until they get assigned to a runner
121 - also on that wiki page are six SLOs; they define measurable properties with an
122 objective that implements aspects of our goals
123 - one example for test reliability, one for infrastructure reliability
126 # Implementation of SLIs
128 GitHub `/statuses` API for submitting PR:
133 "description": "Not yet tested",
134 "context": "debian-stable",
135 "created_at": "2022-01-10T11:08:05Z"
140 - Fortunately, almost all of the required data can be derived from the GitHub statuses API
141 - This is the initial status when submitting a PR
147 Picked up by a worker:
152 "description": "Testing in progress [4-ci-srv-05]",
153 "target_url": "https://logs.cockpit-project.org/...",
154 "context": "debian-stable",
155 "created_at": "2022-01-10T11:08:30Z"
160 - once a bot picks up the pending test request, it will change it to "in progress"
161 - time delta in created_at gives you the time it spent in the queue; for the second SLI that I mentioned
171 "description": "Tests failed with code 3",
172 "target_url": "https://logs...",
173 "context": "debian-stable",
174 "created_at": "2022-01-10T11:47:54Z"
178 [store-tests script](https://github.com/cockpit-project/bots/blob/main/store-tests)
180 [export in Prometheus format](https://logs-https-frontdoor.apps.ocp.ci.centos.org/prometheus)
183 - once test finishes, state changes to success or failure
184 - statuses API remembers the *whole* history; if a fail goes back to
185 "in progress" and eventually "success", it was a retry
186 - we have a store-tests script which reads and interprets this history for a
187 merged PR and puts it into an SQLite database; link is on the slide
188 - regularly do SQL queries to calculate the current SLIs, export them in
189 Prometheus text format, and let a Prometheus instance pick them up to store
195 [Grafana SLOs](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/ci/cockpit-ci?orgId=1)\
197 \
200 - We have a Grafana instance which graphs all SLIs/SLOs; you can move around in
201 time and investigate problem spots more closely
203 - Red bars show the SLO, i.e. where the indicator exceeds the expectation and
204 starts to eat into error budget
205 - Interesting real-time data, not a sufficient view for how much of our error budget we used up in the last month
208 # Error Budget in Grafana
210 [Grafana Error Budget](https://grafana-frontdoor.apps.ocp.ci.centos.org/d/budget/cockpit-ci-error-budget)
212 \
215 - For that, we have another set of graphs which shows the error budget usage of
216 the last 30 days; again, link on the slide
217 - This is the budget for our first SLO about merging a PR with retried tests
218 - We are still good as per our own goal, but will most likely exhaust the
219 budget in the next days, so need to take action soon
224 \
227 - The budget for the other mentioned SLO is the queue time; it's fine normally,
228 but it exploded when the Westford data center went down. This hosts our
229 main workload, and the only place which can run Red Hat internal tests.
230 - Normally we manually spin up a fallback in EC2, but it happened right at the
231 start of the EOY holidays, so nobody cared much -- it was mostly just
232 pending PRs were just automated housekeeping which were not urgent
237 3x auto-retry unaffected tests: {height=20}\
242 - goal “don’t retry PRs too often” is emergent result of the hundreds of individual tests that run on each PR
243 - As explained before, we can't expect 100% success due to random noise, so introduce concept of "affected test": if a PR changes the code which a test covers, or the test itself, that test is affected
244 - Auto-retry unaffected tests up to 3x; that's the carrott, and it made our lifes dramatically better
249 3x pass affected tests: {height=10}\
254 - not sufficient: introduce new flaky tests, overall quality deteriorates, and soon enough not even 3 retries will be enough
255 - Stick: affected tests pass 3x; prevent introducing broken tests
260 Track tests which fail in more than 10% of cases:
262 \qquad \quad {height=30%}\
265 - track tests which fail too often; some base failure rate of few %, but that random noise should distribute evenly across all tests
266 - The ones which fail more than 10% of times are the ones breaking PRs even with auto-retry; need to investigate/fix these
271 - Automatic notifications
272 - Regular SLO review/adjustment
273 - Automatic infra fallback?
274 - Error budget team mode
277 - Pretty happy with this overall; in latest poll of the team, everyone said that they are not feeling blocked by/scared of PRs/tests any more, and productivity and turnaround time is fine
278 - main missing thing: Add notification/escalation from Grafana once error budgets are decreasing and too close to the limit, or even above it
279 - Regularly review and adjust the SLOs to our current feeling of happiness; goals might need to get tighter, or possibly also relaxed; if an SLO is too strict, violated, but nobody cares, don't spend time on it
280 - Maybe find and set up an automatic fallback if our main data center fails; it's a single Ansible playbook, and fallback is expensive, so ok to leave this under human control; will ruin stats, but not actually a pain point
281 - More formal process for going into error budget fixing mode
289 - `#cockpit` on libera.chat
290 - [https://cockpit-project.org](https://cockpit-project.org)
293 - Home page leads to mailing lists, documentation
294 - thanks for your attention; Q+A