AI is leverage. Not judgment.
LLMs are useful enough to change the workflow. They are not reliable enough to remove ownership. The practical line: use them aggressively when verification is cheap, and slow down when the model cannot carry the cost of being wrong.
If checking is cheaper than creating, use AI.
If checking is harder than doing it yourself, you are not saving time. You are moving risk into review.
utility rises faster than accountability
SWE-bench Verified performance moved from roughly 60% to near 100% in one year.
Stanford AI Index 2026
OSWorld task success jumped, but agents still fail about 1 in 3 structured attempts.
Stanford AI Index 2026
More developers said they distrust AI output accuracy than trust it.
Stack Overflow 2025
In one RCT, AI tools slowed experienced open-source developers on mature repos.
Becker et al., 2025
The jagged frontier.
The mistake is treating AI capability like one smooth curve. It is not. The same system can look elite in a coding benchmark, miss a simple visual task, produce an excellent first draft, and then invent a citation with total confidence.
That does not make the technology fake. It means the workflow has to be designed around uneven reliability.
Illustrative map, not a benchmark. The point is shape: model strength drops when the task moves from pattern work into messy context and human responsibility.
Where it still breaks.
Not anti-AI. Anti-blind-delegation.
Uncertainty.
The model can sound settled when the world is not. OpenAI's own hallucination note frames the issue as a persistent uncertainty problem, not something solved by a smarter next model alone.
Long horizon.
Agents can be excellent at bounded steps and still weak at work that takes many coordinated moves. METR's time-horizon work is useful because it measures reliability against the length of tasks, not just whether a benchmark answer looks right.
Tacit context.
Most real work is not a clean prompt. It depends on old decisions, company politics, hidden constraints, customer history, and knowing which rule matters this time.
Accountability.
An LLM can draft the memo, test, spreadsheet, or summary. It cannot carry the legal, financial, medical, security, or reputational consequence when the output is wrong.
Taste.
Models average patterns. They can produce polished options, but they do not know what you are willing to stand behind unless a human supplies taste, strategy, and refusal.
Security.
Tool access turns mistakes into actions. The risk is not only bad text; it is a model reading files, changing code, leaking context, following prompt injection, or creating review debt.
A developer who refuses AI is not backwards.
Sometimes the right tool is still grep, docs, tests, and a quiet hour. That is especially true in mature systems where the hard part is not generating code, but respecting the existing architecture, edge cases, security posture, and the human promises buried in old decisions.
A 2025 RCT of experienced open-source developers found that AI tools slowed participants down on mature repos, even though the developers expected the opposite. That result will not apply to every team, model, or workflow. But it validates the instinct: if the model creates more review surface than useful work, opting out is rational.
The mature posture is not all-in or all-out. It is knowing when the output is cheap raw material and when it becomes expensive uncertainty.
Boilerplate, first drafts, code search, summaries, transforms, test scaffolds, option generation.
Financial models, SQL, production code, customer-facing copy, research briefs, legal-adjacent language.
High-stakes medical, legal, lending, security, employment, safety, privacy, or anything you cannot personally verify.
The operating model.
Treat the model like a very fast junior analyst with broad memory, inconsistent judgment, and no skin in the game.
Give it bounded work: summarize this file, compare these options, draft tests for this function, find the mismatch in this table.
Ask for sources, line numbers, calculations, assumptions, and what would change the answer.
Open the file. Run the code. Click the link. Check the math. If the output cannot survive reality, it is just autocomplete with confidence.
The final call belongs to the operator. AI can accelerate the work, but the person ships the consequence.
The pro-tech position is not trust. It is instrumentation.
AI belongs in the stack. But the stack needs tests, logs, source checks, version control, privacy boundaries, human review, and a clear rule for when the model is allowed to act. The more powerful LLMs get, the more this matters.
Sources.
Hallucinations remain a challenge; uncertainty and benchmark incentives matter.
Stanford HAI: 2026 AI Index ReportCurrent capability gains, jagged frontier examples, SWE-bench and OSWorld context.
METR: Task-completion time horizonsWhy task length, messiness, and low-context work change the automation claim.
Stack Overflow: 2025 Developer SurveyDeveloper trust, frustration with almost-right output, and AI-agent adoption.
Becker, Rush, Barnes, Rein: AI and experienced developer productivityRCT finding that early-2025 AI tools slowed experienced developers on mature projects.
NIST AI 600-1: Generative AI ProfileRisk categories including confabulation, privacy, human-AI configuration, and information integrity.