The Pain-Point List: What Developers Actually Complain About in AI Coding¶

2026-07-03

Early in 2025, Karpathy coined "vibe coding" — write on feel, forget the code even exists. A year later, even he says that era is ending and we're entering "agentic engineering." Spec-Driven Development (SDD) became the new keyword, and a wave of tools appeared overnight: GitHub Spec Kit, Amazon Kiro, Tessl, OpenSpec, BMAD, and more.

We build SpexCode, so of course we're on the spec side. But before telling our own story, we wanted to do something more honest first: go into the community without a prior, and see what people actually complain about.

So we spent a while reading a dozen high-traffic Hacker News threads (thousands of comments) and the issue/discussion trackers of a dozen open-source SDD projects (60+ distinct complaints). The list below quotes people verbatim wherever possible, and every source is linked. Our own opinion is at the very end — you can skip it and read only what the community said.

How to read this list¶

For each pain point we rate severity along three axes:

Breadth — how many independent people, across how many threads, reproduce it;
Intensity — how much emotion, how much real loss;
Solved? — has any tool actually fixed it.

Combined into three tiers: 🔴 P0 fatal and universally unsolved / 🟠 P1 serious / 🟡 P2 moderate or localized.

One caveat up front: the community can't even agree what "vibe coding" means (Karpathy's "forget the code" vs. any LLM-assisted coding?). A lot of the heat is definitional. Below we keep only substantive, reproducible complaints about the actual practice.

1. Vibe coding's bill comes due on day 90¶

The most common one: so-called "throwaway prototype code" never gets thrown away. It ships to production and never gets rewritten.

"'throwaway prototyping code' gets shipped straight to production and never updated, because it's addressing real needs right now and there's no time to fix it... vibe coding is just making the problem much worse." — HN user @Analemma_, Vibe Debugging: Enterprises' Up and Coming Nightmare

"There's nothing more permanent than a throwaway prototype." — HN user @Sharlin, same thread

And the debt compounds exponentially:

"3 months into the vibe-coded project you discover there is a concurrency issue. But it is now all over the place in hundreds of variations. How do you even fix it?" — HN user @kikimora, The problem with "vibe coding"

The community calls this "the 90-day reckoning": around month three, once the codebase crosses some size, adding features starts breaking existing ones, and the cost of maintenance begins to exceed writing it by hand.

Severity: 🟠 P1. Extremely common. One honest caveat: the specific figures floating around ("12% test coverage," "300% maintenance-cost growth," "40% of projects cancelled by 2028") mostly trace back to vendor marketing blogs with no primary source — don't cite them. The one rigorously-sourced data point is METR's randomized controlled trial: 16 experienced open-source developers, 246 real tasks — developers expected AI to make them 24% faster and still felt 20% faster afterward, while they were actually 19% slower (arXiv 2507.09089, metr.org, July 2025). What it proves isn't just "tech debt is expensive," but something more fundamental: even "faster" itself can be an illusion.

2. Generation got cheap; review became the bottleneck¶

Probably the single most consistent structural observation across every thread: AI made writing cheap, so the cost shifted entirely onto review — and review doesn't scale.

"Organizations now generate 10x the amount of code, because everyone can do it. But we have exactly the same number of reviewers." — HN user @whatever1, "Vibe code hell" has replaced "tutorial hell"

"the code-review process is, shall we say, not fun... there is way more back-and-forth... I have a feeling these claims of being more productive don't account for the entire development cycle." — HN user @unzadunza, Vibe Debugging

A harder layer: review presupposes comprehension, and you can't review what you don't understand.

"if you don't understand the scope fully, then you can't possibly validate what the AI is spitting out, you can only hope that it has not fucked up." — HN user @louthy (40-year engineer), Vibe coding creates fatigue?

And AI output looks exactly like good engineering — which is what makes it dangerous:

"code that agents write looks plausible and impressive while it's being written and presented to you. It even looks good in pull requests (as both you and the agent are well trained in what a 'good' pull request looks like)." — from After two years of vibecoding, I'm back to writing by hand (quoted by @simonw)

"code bases that seem very well architected and engineered. But once you start using the code for a week you notice it's full of bugs and the consistent architecture is just make believe... this problem is getting way worse with generated code that is optimized to read like it's very well engineered." — HN user @blablabla123, The problem with "vibe coding"

Severity: 🔴 P0. Enormous breadth, and unsolved — because it's a direct consequence of the fundamental asymmetry that generation is cheap and verification is expensive.

3. Loss of intent: nobody knows why the code is the way it is¶

The most painful moment in debugging: a bug appears, and no one can explain why it was written that way.

"when an inevitable bug bubbles up... 'Hey, you changed the way transactionality was handled here, and that's made a really weird race condition happen. Why did you change it?'... 'I don't know, the AI did it'. This makes chasing things down exponentially harder." — HN user @frio, Vibe coding creates fatigue?

"LLMs dont care about the story, they just care about the current state of the code... it doesn't see the iterations and hacks and refactors and reverts... In the end I am redesigning my library from scratch with minimal AI input." — HN user @tomaytotomato, Vibe coding kills open source

Original intent is a precious asset, and it can't be fully seen in the code. This one hits close to home — it's the flip side of the "hidden knowledge and collapse" idea in our Second Day Problem post.

Severity: 🟠 P1. Widespread, and worsens with project complexity.

4. So everyone turned to Spec-Driven — but the bill arrived¶

Those pain points are exactly what SDD tools claim to fix: write intent as a spec, make the spec the source of truth, use it to constrain the agent. Sounds right. But in these tools' issue trackers, the community reports a fresh set of bills.

Waterfall with extra steps / "the illusion of work." The most-repeated framing:

"SpecKit creates the illusion of work, generating a bunch of text... Overengineering, because the LLM, instead of executing a concise command, is busy analyzing kilobytes of text and generating new texts, completely losing the essence of the work." — GitHub github/spec-kit Issue #75 (the widely-shared Discussion #1784)

"We've reinvented Big Design Up Front. We just replaced Word documents with Markdown and project managers with LLMs." — Nils Kjellman, Your Spec Driven Workflow Is Just Waterfall With Extra Steps

Too much ceremony, double the cost. People benchmark against just-prompt-the-agent, and SDD comes up short:

"I was seeing tokens get up to 20k or more... before writing a single file... But then I run a command like this directly against claude... I get the first task completed with about 2k tokens." — claude-code-spec-workflow Discussion #41

"OpenSpec produced 50% more code having 50% more cyclomatic complexity... OpenSpec took twice as long and cost three times as much." — openspec Discussion #1159

Kiro's cost complaints are louder still (AWS later admitted a billing bug):

"The allocated monthly limits were completely consumed within 15 minutes." — kiro Issue #2171

Context saturation — the great irony. The specs meant to provide context instead flood the window:

"spec-kit commands consume approximately 18.6k tokens" — ~93% of a Cursor default window — "This creates a significant 'context tax' that reduces the available working space for actual development tasks." — spec-kit Issue #1401

"when you put kilobytes of text into context it can randomly fail. More input data = more unpredictable results... Then I reduced CLAUDE.md by three times and surprise! It greatly improved stability." — spec-kit Issue #75

Severity: 🟠 P1. Common among SDD-tool users, and it bites hardest on small/medium changes.

5. The one that stings most: spec and code drift apart anyway¶

If you remember only one thing, make it this. SDD's entire pitch is "the spec is the living source of truth" — and this is exactly where users report failure most universally: nothing reconciles the spec with code after an out-of-band change. We found no user anywhere reporting a spec that self-updates from code. Two maintainers openly concede it isn't built.

"In an ideal world any drift between code and spec should be 'reconciled'. Ideally this happens automatically without you having to worry about... up till then this process has to be manual unfortunately." — openspec maintainer, Discussion #169

"quick fixes happen... at 11pm, and nobody remembers to sync the specs after. By the time you open the next session, the specs are stale and the agent is working from outdated context." — same thread

"The generated output after the spec is implemented quickly becomes out of date... In a perfect world there would be some background job that can detect when a feature deviates from the original spec and suggests changes to bring it up to date." — spec-kit Issue #620

Kiro's three spec files don't self-update either; users resort to manual git diff:

"requirements and design artifacts don't auto-update." — François Dexemple, Brilliant, Broken, and Frustrating: My Deep Dive into Amazon's Kiro AI IDE

The category-level verdict is blunt:

"keeping specs in sync with the code creates a maintenance tax that grows with system complexity... every update becomes documentation debt disguised as engineering discipline." — Isoform, The Limits of Spec-Driven Development

Severity: 🔴 P0, and openly unsolved. Reproduced across OpenSpec (≥8 issues), Spec Kit (≥6), BMAD, and Kiro (retrofitting git-ref detection post-launch), with maintainers conceding it's manual. This is the category's open gap.

6. Prose can't hold an agent: the community is asking for mechanism¶

The second-loudest theme: steps hard-coded in a workflow get treated as suggestions — agents skip them, deviate from the spec, mark placeholders as "done."

"Claude exhibits a critical pattern of stopping workflow execution after Step 8 (quality checks), treating technical functionality as 'complete'... No commit, no PR, no cleanup... Work exists only in feature branch ('zombie work')." — agent-os Issue #36

"Commands like /create-spec, /execute-task treated as suggestions... Claude Code frequently ignores mandatory workflow instructions." — agent-os Issue #33

Here's the interesting part — what happened next: users started explicitly requesting deterministic, un-bypassable enforcement.

(request) "hooks for deterministic workflow enforcement." — agent-os Issue #37

"make it structurally impossible to mark a task complete without logging, regardless of what the AI agent's prompt says." — spec-workflow-mcp Issue #199

In other words, the community reached its own conclusion: prose (prompts, spec documents) can't constrain an agent; you need mechanism.

Severity: 🔴 P0. Because it's a directional consensus the community grew out of its own pain — not a vendor's marketing.

Severity summary¶

Pain point	Severity	Breadth	Solved?
Spec ↔ code drift	🔴 P0	Category-wide	Openly unsolved
Prose can't hold the agent (skip / deviate / false-done)	🔴 P0	Category-wide	Unsolved; users asking for mechanism
Review-burden inversion (generation cheap, verify is bottleneck)	🔴 P0	Very common	Unsolved
"Looks like good engineering" slop	🟠 P1	Very common	Unsolved
Context saturation (specs flood the window)	🟠 P1	SDD tools broadly	Unsolved
Prototype-to-prod / 90-day tech-debt wall	🟠 P1	Very common	Unsolved
Cost / token blowup (vs. plain-agent baseline)	🟠 P1	SDD tools broadly	Unsolved
Loss of intent (nobody knows why)	🟠 P1	Common	Unsolved
Ceremony overkill = waterfall / illusion of work	🟠 P1	SDD tools broadly	Unsolved
Non-determinism (same spec, twice, different result)	🟠 P1	Common	Inherently hard
Cognitive fatigue / never-finished vigilance burnout	🟠 P1	Strongly reproduced	Unsolved
Tests as false safety (LLM games its own tests)	🟠 P1	Common	Unsolved
Over-engineering / ignores existing codebase	🟡 P2	Common	Unsolved
Skill atrophy & broken junior pipeline	🟡 P2	Societal	—
OSS maintainers drowning in AI slop PRs	🟡 P2	OSS-local but already happening	Unsolved

What we read into this list¶

The observation ends here. What follows is our reading.

The three P0s interlock: review doesn't scale because the agent's code has shed its intent and will drift from the spec at any moment; and the reason nothing holds it is that everyone keeps using prose (prompts, ever-longer spec docs) to constrain a thing that doesn't read prose.

The striking part is that the community has already shouted the shape of the answer — "mechanism, not more prose" (agent-os asking for hook enforcement, spec-workflow asking for "structurally impossible" gates).

That's the bet SpexCode makes:

Against drift — we don't treat the spec as a document you must remember to write back. A node's "version" is the count of content commits to its spec.md; drift is derived live from git, no stored hashes, no reliance on discipline. Spec and code are forced into the same commit by the same gates (pre-commit, main-guard, commit-before-declare).
Against "prose can't hold the agent" — in SpexCode the contract is data (a config node), auto-discovered by the harness as an always-on mechanism, not a launch-time --append-system-prompt. The "deterministic enforcement" people ask for elsewhere is the default here.
Against loss of intent — original intent and design decisions live next to the code and can be read; the recent/history version timeline, plus each commit's Session: attribution, reconstruct "why it changed" straight from git.

We don't claim to have solved drift — it's an industry-wide hard problem. We've simply chosen to take the project's most extreme belief, "git is the database," all the way: let spec and code be forced into sync by one mechanism in git, rather than by anyone's discipline. That's the same thing we said in The Second Day Problem — bending the agent's flat capability curve upward.

Sources: Hacker News threads — Back to writing by hand, Vibe coding kills open source, Fatigue, Vibe code hell, Vibe Debugging, Waterfall Strikes Back, Are you still using SDD?; GitHub projects — spec-kit (#75/#1784, #464, #620, #1401), Kiro (#2171), OpenSpec (#169, #1159), agent-os (#33/#36/#37), spec-workflow-mcp (#199), claude-code-spec-workflow (#41); longer pieces — Nils Kjellman, Isoform, François Dexemple (Kiro review), Red Hat; the productivity figure is from METR's randomized controlled trial (arXiv 2507.09089). Quotes are verbatim; where a Chinese translation exists, the original governs.