Goal

In the Cockpit team we spend a lot of our time on PR reviews. That’s time well spent – we all learn from each other, it keeps the code quality high and ourselves honest. But most certainly there is room for optimization: There are always silly or boring things like typos, inconsistent formatting, or inefficient algorithms; and humans also have selective and subjective sight, i.e. are often missing things.

This sounds like a perfect opportunity to get some computerized aid from AI. Our hope wasn’t to replace human reviews, but that AI would spare us some typing for the boring/obvious flaws and find different kinds of errors than humans.

In mid-March the Cockpit team started to evaluate automated AI PR reviews of our main projects cockpit, cockpit-podman, and friends with sourcery.ai, a company/project which specializes in just that. Around the same time, GitHub Copilot PR reviews became available as well, so it was the perfect opportunity to compare these.

First synthetic test

For an initial test drive, I created a test PR with four bugs. They may not be completely obvious to a novice developer, but a seasoned one should have spotted them: An off-by-one in an iteration, dropping an element property, direct React state modification (in case you don’t use React: that’s badbadbad, you need to do a setState() like call), and a wrong invocation of event handler. In case you wonder, I only retitled the PR to “(buggy, AI review test)” after the fact – the initial version just said “Various fixes”, to avoid giving away any possible clue.

The “classic” (by now) static source code analysis just found the wrong invocation of event handler. GitHub’s Copilot found three of the issues (missing the dropped property), and sourcery.ai still spotted two. Both responses were clear, with suggestions and explanations.

I was really impressed with that! After that we decided to turn sourcery.ai on for good on all our projects, and evaluate it in practice.

Recently, the Cockpit team worked mostly on the PatternFly 5 → 6 migration, where no AI has useful feedback. It’s completely understandable that it does not comment on pixels or alignment issues, as that’s (1) not something an algorithm without any visual input can even grasp, and (2) is often a matter of human taste. However, in principle it could know about some things: Valid vs. invalid CSS properties, using logical instead of physical ones, using consistent units or variables instead of hardcoded values, and so on.

In practice, we got nothing useful at all. Just some “I reviewed this and have nothing to say, other than perhaps some random nitpick suggestions”, like in #21805. This PR was even worse, where it gave two recommendations: The first one being actively wrong and harmful (you should never squash unrelated changes), the second one unfounded (commit message and code comments were already describing the changes).

Quotes from our two design-oriented developers: Freya “I’ve not seen much usefulness from it”, and Garrett “In other words, I still haven’t found Sourcery useful”.

Given how good static code analyzers like Coverity or shellcheck already are, I had much higher expectations here, especially after my trial PR above. But I went through all PRs of the last four weeks, and did not find anything truly useful or surprising.

Most of them were just noise, like in this storage btrfs usage fix. That is a good representative for about half of the PRs. I. e. it does not have anything to say, but still creates a comment which causes a notification and needs time for reading it. Unfortunately there isn’t a config knob to turn that off. Quoting Jelle: “the notifications are kinda pollution”, and I agree.

Same pattern here or here.

There is also slightly more harmful noise like in this ruff warning fix: This demonstrates that it really doesn’t understand what’s going on, i.e. that dropping a static linter ignore is good instead of bad.

A similar case is this linux-system-roles PR, this time from Copilot. The -- is right, but to understand this you can’t just look at the patch, you need to “see” the rest of the code (script, documentation) to understand it. However, it’s excusable in this case as it deemed its advice as “low confidence”. So it’s not a big criticism, but nicely demonstrates how this machinery actually works, and the narrow limits of its conclusions.

But this review thread of that sudo roles PR was even worse: It exudes a confidence that a junior developer may easily believe, while it is non-sense and rather bad to do: making the code much more complicated while not addressing the alleged problem at all (and it’s not even a problem in the first place).

Marius summarizes it quite well: “As always, I am impressed by how well AI understands the question, and by how it reasonable the replies look. But I think it can’t really contribute much to our PRs. They are too good already. It comes with refactoring suggestions, which look correct and reasonable, but I am not sure we want another ‘opinion’”.

Conclusion

About half of the AI reviews were noise, a quarter bikeshedding. The rest consisted of about 50% useful little hints and 50% outright wrong comments. Last week we reviewed all our experiences in the team and eventually decided to switch off sourcery.ai again. Instead, we will explicitly ask for Copilot reviews for PRs where the human deems it potentially useful.

This outcome reflects my personal experience with using GitHub Copilot in vim for about 1.5 years – it’s a poisoned gift. Most often it just figured out the correct sequence of ), ], and } to close, or automatically generating debug print statements – for that “typing helper” work it was actually quite nice. But for anything more nontrivial, I found it took me more time to validate the code and fix the numerous big and subtle errors than it saved me.

So given the very mediocre output, this isn’t a time saver – to the contrary, it may even take us more time to review and check the bad reviews and ignore the noise. Given how ridiculously expensive and damaging AI is to the world (I’ll expand on that in a future blog post), this is simply not a good choice for our team at this state of the art.

I really hope other people and teams have more success with this – AI a hype either way, but I can’t believe it is that bad for everyone?

Martin Pitt

Testing sourcery.ai and GitHub Copilot for cockpit PR reviews

Goal

First synthetic test

Conclusion

Martin Pitt

Goal

First synthetic test

Design/CSS related PRs

Logic related PRs

Conclusion