rss-bridge 2026-01-28T15:00:00+00:00

Are bugs and incidents inevitable with AI coding agents?

What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

January 28, 2026

Are bugs and incidents inevitable with AI coding agents?

What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

Credit: Alexandra Francis*

Companies are looking to harness agentic code generators to get software built faster. But for every story of increased developer productivity or greater code base understanding, there’s a story about creating more bugs and the increased likelihood of production outages.

Here at CodeRabbit, we wanted to know if the problems people have been seeing are real and, if so, how bad they are. We’ve seen data and studies about this same question, but many of them are just qualitative surveys sharing vibes about vibe coding. This doesn’t show us a path to a solution, only a perception.

We wanted something a little more actionable with actual data. What specific kind of bugs is AI more likely to generate? Do some categories of bugs show up more often? How severe are they? How is this impacting production environments?

In this article, we’ll talk about the research we did, what it means for you as a developer, and how you can mitigate the mistakes that LLMs make.

What our research says

To find answers to our questions, we scanned 470 open-access GitHub repos to create our State of AI vs. Human Code Generation Report. We looked for signals that indicated pull requests were AI co-authored or human created, like commit messages or agentic IDE files.

What we found is that there are some bugs that humans create more often and some that AI creates more often. For example, humans create more typos and difficult-to-test code than AI. But overall, AI created 1.7 times as many bugs as humans. Code generation tools promise speed but get tripped up by the errors they introduce. It’s not just little bugs: AI created 1.3-1.7 times more critical and major issues.

The biggest issues lay in logic and correctness. AI-created PRs had 75% more of these errors, adding up to 194 incidences per hundred PRs. This includes logic mistakes, dependency and configuration errors, and errors in control flows. Errors like these are the easiest to overlook in a code review, as they can look like reasonable code unless you walk through it to understand it.

Logic and correctness issues can cause serious problems in production: the kinds of outages that you have to report to shareholders. We’ve found that 2025 had a higher level of outages and other incidents, even beyond what we’ve heard about in the news. While we can’t tie all those outages to AI on a one-to-one basis, this was the year that AI coding went mainstream.

We also found a number of other issues that, while they may not disable your app, were alarming:

Security issues: AI included bugs like improper password handling and insecure object references at a 1.5-2x greater rate than human coders.
Performance issues: We didn’t see a lot of these, but those that we found were heavily AI-created. Excessive I/O operations were ~8x higher in AI code.
Concurrency and dependency correctness: AI was twice as likely to make these mistakes, which include misuse of concurrency primitives, incorrect ordering, and dependency flow errors.
Error handling: AI-generated PRs were almost twice as likely to check for errors and exceptions like null pointers, early returns, and pro-active defensive coding practices.

The single biggest difference between AI and human code was readability: AI had 3x the readability issues as human code. It had 2.66x more formatting problems and 2x more naming inconsistencies. While these aren’t the issues that will take your software offline, they will make it harder to debug the issues that can.

Why errors happen with coding agents

Major errors happen largely because these coding agents are primarily trained on next token prediction based on large swaths of training data. That training data includes large numbers of open-source or otherwise unsecure code repositories, but it doesn’t include your code base. That is, any given LLM that you use is going to lack the necessary context to write the correct code.

When you try to provide that context as a system prompt or agents.md file, that may work depending on the LLM or agentic harness you’re using. But eventually, the AI tool will need to compact the context or use a sliding window strategy to manage it efficiently. At the end of the day, though, you’re dropping information. If you have a task list where the agent is supposed to create code, review it, and check it off when it's done, eventually it forgets. It starts forgetting more and more along the way until the point where you have to stop it and start over.

We’re past the days of code completion and cut and pasting from chat windows. People are using AI agents and running them autonomously now, sometimes for very long periods of time. Any mistakes—hallucinations, errors in context, even slight missteps—compound over the running time of the agent. By the end, those mistakes are baked into the code.

Agentic coding tools make generating code incredibly easy. To a certain degree, it's fun to be able magically drop 500 lines of code in a minute. You’ve got five windows going, five different things being implemented at the same time. No idea what any of them are building, but they're all being built right now.

Eventually, though, someone will need to make sure that code works, to ensure that only quality code hits the production servers.

Why AI code is so hard to review

There’s a joke that if you want a lot of comments, make a PR with 10 lines of code. If you want it approved immediately, commit 500 lines of code. This is the law of triviality: small changes get more attention than big changes. With agentic code generators, it becomes very easy to commit these very large commits with massive diffs.

Massive commits combined with hard-to-read code makes it very easy for serious logic and correctness errors to slip through. This is where the readability problem compounds. AI creates more surrounding harness code and little inline comments. There's just a lot more to read. Unless someone (preferably multiple someones) is combing through every single line of code on these commits, you could be creating tech debt at a scale not previously imagined.

Think of a code base over the lifetime of a company. Early-stage companies have a mentality of moving fast, getting your software out there, but maintainability, complexity, and readability issues compound over time. It may not cause the outage, but it will make that outage harder to fix. Eventually, that tech debt has to be paid off. Either the company dies or somebody has to rewrite everything because nobody can follow what any of the code is doing.

What you can do to stop errors

People want to use agentic coding tools and get the productivity gains. But it’s important to use them in a way that mitigates some of the potential downstream effects and prevents AI-generated errors from affecting your uptime. At every stage in the process, there are things you can do to make the end result better.

Pre-plan

[...]

Original source