The Perverse Incentives of Vibe Coding | by fred benenson | May, 2025

by oqtey
The Perverse Incentives of Vibe Coding | by fred benenson | May, 2025

And indeed, the highly complex tasks I’ve handed to them have largely resulted in failure: implementing a minimax algorithm in a novel card game, crafting thoughtful animations in CSS, completely refactoring a codebase. The LLMs routinely get lost in the sauce when it comes to thinking through the high level principles required to solve difficult problems with computer science.

In the example above, my human implemented version of minimax from 2018 totals 400 lines of code, whereas Claude Code’s version comes in at 627 lines. The LLM version also requires almost a dozen other library files. Granted, this version is in TypeScript and has a ton of extra bells and whistles, some of which I explicitly asked for, but the real problem is: it doesn’t actually work. Furthermore, using the LLM to debug it requires sending the bloated code back and forth to the API every time I want to holistically debug it.

In an effort to impress the user and over-deliver, LLMs end up creating a rat’s nest of ultra-defensive code littered with debugging statements, neurotic comments and barely-useful helper funcitions. If you’ve ever worked in a hihgly functional production codebase, this is enough to drive you insane.

I think everyone who spends any time vibe coding eventually discovers something like this and realizes that it’s much more worthwhile to work with a plan composed of discrete tasks that could be explained to a junior level developer vs. a feature-level project handed off to a staff engineer.

There’s also the likelihood that the vast majority of code that LLMs have been trained on tends to be inelegant and overly verbose. Lord knows there’s a lot of AbstractJavaFinalSerializedFactory code out there.

But I’m beginning to think the problem runs deeper, and it has to do with the economics of AI assistance.

Many AI coding assistants, including Claude Code, charge based on token count — essentially the amount of text processed and generated. This creates what economists would call a “perverse incentive” — an incentive that produces behavior contrary to what’s actually desired.

Let’s break down how this works:

  1. The AI generates verbose, procedural code for a given task
  2. This code becomes part of the context when you ask for further changes or additions (this is key)
  3. The AI now has to read (and you pay for) this verbose code in every subsequent interaction
  4. More tokens processed = more revenue for the company behind the AI
  5. The LLM developers have no incentive to “fix” the verbose code problem because doing so will meaningfully impact their bottom line

As Upton Sinclair famously noted: “It is difficult to get a man to understand something when his salary depends on his not understanding it.” Similarly, it might be difficult for AI companies to prioritize code conciseness when their revenue depends on token count.

This pattern points to a more general concern in AI development: the alignment between how systems are monetized and how well they serve user needs. When charging by token count, there’s naturally less incentive to optimize for elegant, minimal solutions.

Even “all you can eat” subscription plans (e.g. Claude’s “Max” subscription) don’t fully resolve this tension, as they typically come with usage caps or other limitations that maintain the underlying incentive structure.

The perverse incentives in AI code generation point to a more fundamental issue that extends beyond coding assistants. When she was reading a draft of this, Louise pointed out some recent research from Giskard AI’s Phare benchmark that reveals a troubling pattern that mirrors our coding dilemma: demanding shorter responses jeopardizes the accuracy of the answers.

According to their findings, instructions emphasizing conciseness (like “answer this question briefly”) significantly degraded factual reliability across most models tested — in some cases causing a 20% drop in hallucination resistance. When forced to be concise, models face an impossible choice between fabricating short but inaccurate answers or appearing unhelpful by rejecting the question entirely. The data shows models consistently prioritize brevity over accuracy when given these constraints.

There’s clearly something going on where the more verbose the LLM is, the better it does. This actually makes sense given the discovery that chain-of-thought reasoning improves accuracy, but this issue has begun to feel like a real tradeoff when it comes to these almost-magical systems.

We see this exact tension in code generation every day. When we optimize for conciseness and ask for the problems to be solved in fewer setps, we often sacrifice quality. The difference is that in coding, the sacrifice manifests as over-engineered verbosity — the model produces more tokens to cover all possible edge cases rather than thinking deeply about the elegant core solution or a root cause problem. In both cases, economic incentives (token optimization) work against quality outcomes (factual accuracy or elegant code).

Just as Phare’s research suggests that seemingly innocent prompts like “be concise” can sabotage a model’s ability to debunk misinformation, our experience shows that standard prompting approaches can yield bloated, inefficient code. In both domains, the fundamental misalignment between token economics and quality outputs creates a persistent tension that users must actively manage.

While we wait for AI companies to better align their incentives with our need for elegant code, I’ve developed several strategies to counteract verbose code generation:

I harass the LLM to write a detailed plan before generating any code. This forces the model to think through the architecture and approach, rather than diving straight into implementation details. Often, I find that a well-articulated plan leads to more concise code, as the model has already resolved the logical structure of the solution before writing a single line.

I’ve implemented a strict “ask before generating” protocol in my workflow. My personal CLAUDE.md file explicitly instructs Claude to request permission before writing any code. Infuriatingly, Claude Code regularly ignores this, likely due to its massive system prompt that talks so much about writing code it overrides my preferences. Enforcing this boundary and repeatedly belaboring it (“remember, don’t write any code”) helps prevent the automatic generation of unwanted, verbose solutions.

Version control becomes essential when working with AI-generated code. I frequently benchmark code in git when I arrive at an “ok it works as intended” moment. Creating experimental branches is also very helpful. Most importantly, I’m ready to throw out branches entirely when fixing them would require more work than starting from scratch. This willingness to abandon sunk costs is surprisingly important — it helps me work through problems and figure out the AI’s hangups while preventing the accumulation of bandaid solutions on top of fundamentally flawed approaches.

Sometimes the simplest solution works best: using a smaller, cheaper model often results in more direct solutions. These models tend to generate less verbose code simply because they have limited context windows and processing capacity. While they might not handle extremely complex problems as well, for many day-to-day coding tasks, their constraints can actually produce more elegant solutions. For example, Claude 3.5 Haiku is currently 26% the price of Claude 3.7 ($0.80 per token vs. $3). Also, Claude 3.7 seems to overengineer more frequently than Claude 3.5.

What might a better approach look like?

  • LLM coding agents could evaluated and incentivized based on code quality metrics rather than just token counts. The challenge here is that this kind of metric is quite subjective.
  • Companies could offer pricing models that reward efficiency rather than verbosity (I have no idea how this would work, this was Claude’s dumb idea)
  • LLMs training should incorporate feedback mechanisms that specifically promote concise, elegant solutions via RLHF (e.g. showing developers multiple versions of the same code and having them pick the opitmal one, perhaps this is already happening)
  • Companies realize that overly verbose code generation is not good for their bottom line (e.g. Sam Altman admitted that users saying “please” and “thank you” to ChatGPT is costing them millions of dollars)

This isn’t just about getting better AI — it’s about making sure that the economic incentives driving AI development align with what we actually value as developers: clean, maintainable, elegant code that solves problems at their root.

Until then, don’t forget: brevity is the soul of wit, and machines have no soul.

Thanks to Louise Macfadyen, Justin Kazmark and Bethany Crystal for reading and suggesting edits to a draft of this.

— –

PS: Yes, I used Claude to help write this post critiquing AI verbosity. There’s a delicious irony here: these systems will happily help you articulate why they might be ripping you off. Their willingness to steelman arguments against their own economic interests shows that the perverse incentives aren’t embedded in the models themselves, but in the business decisions surrounding them. In other words, don’t blame the AI — blame the humans optimizing the revenue models. The machines are just doing what they’re told, even when that includes explaining how they’re being told to do too much.

Related Posts

Leave a Comment