AI is leverage. Not judgment.

LLMs are useful enough to change the workflow. They are not reliable enough to remove ownership. The practical line: use them aggressively when verification is cheap, and slow down when the model cannot carry the cost of being wrong.

Operator rule

If checking is cheaper than creating, use AI.

If checking is harder than doing it yourself, you are not saving time. You are moving risk into review.

use

draft

check

verify

own

decide

utility rises faster than accountability

Coding benchmarks

near 100%

SWE-bench Verified performance moved from roughly 60% to near 100% in one year.

Stanford AI Index 2026

Real computer tasks

~66%

OSWorld task success jumped, but agents still fail about 1 in 3 structured attempts.

Stanford AI Index 2026

Developer trust gap

46%

More developers said they distrust AI output accuracy than trust it.

Stack Overflow 2025

Experienced repo work

+19%

In one RCT, AI tools slowed experienced open-source developers on mature repos.

Becker et al., 2025

The jagged frontier.

The mistake is treating AI capability like one smooth curve. It is not. The same system can look elite in a coding benchmark, miss a simple visual task, produce an excellent first draft, and then invent a citation with total confidence.

That does not make the technology fake. It means the workflow has to be designed around uneven reliability.

Pattern completion94%

Bounded coding tasks84%

Tool-using agents66%

High-context judgment38%

Accountable ownership0%

Illustrative map, not a benchmark. The point is shape: model strength drops when the task moves from pattern work into messy context and human responsibility.

Where it still breaks.

Not anti-AI. Anti-blind-delegation.

Uncertainty.

The model can sound settled when the world is not. OpenAI's own hallucination note frames the issue as a persistent uncertainty problem, not something solved by a smarter next model alone.

Long horizon.

Agents can be excellent at bounded steps and still weak at work that takes many coordinated moves. METR's time-horizon work is useful because it measures reliability against the length of tasks, not just whether a benchmark answer looks right.

Tacit context.

Most real work is not a clean prompt. It depends on old decisions, company politics, hidden constraints, customer history, and knowing which rule matters this time.

Accountability.

An LLM can draft the memo, test, spreadsheet, or summary. It cannot carry the legal, financial, medical, security, or reputational consequence when the output is wrong.

Taste.

Models average patterns. They can produce polished options, but they do not know what you are willing to stand behind unless a human supplies taste, strategy, and refusal.

Security.

Tool access turns mistakes into actions. The risk is not only bad text; it is a model reading files, changing code, leaking context, following prompt injection, or creating review debt.

A developer who refuses AI is not backwards.

Sometimes the right tool is still grep, docs, tests, and a quiet hour. That is especially true in mature systems where the hard part is not generating code, but respecting the existing architecture, edge cases, security posture, and the human promises buried in old decisions.

A 2025 RCT of experienced open-source developers found that AI tools slowed participants down on mature repos, even though the developers expected the opposite. That result will not apply to every team, model, or workflow. But it validates the instinct: if the model creates more review surface than useful work, opting out is rational.

The mature posture is not all-in or all-out. It is knowing when the output is cheap raw material and when it becomes expensive uncertainty.

Use hard

Boilerplate, first drafts, code search, summaries, transforms, test scaffolds, option generation.

Use with a leash

Financial models, SQL, production code, customer-facing copy, research briefs, legal-adjacent language.

Slow down

High-stakes medical, legal, lending, security, employment, safety, privacy, or anything you cannot personally verify.

The operating model.

Treat the model like a very fast junior analyst with broad memory, inconsistent judgment, and no skin in the game.

1. Narrow the task

Give it bounded work: summarize this file, compare these options, draft tests for this function, find the mismatch in this table.

2. Force receipts

Ask for sources, line numbers, calculations, assumptions, and what would change the answer.

3. Verify outside the chat

Open the file. Run the code. Click the link. Check the math. If the output cannot survive reality, it is just autocomplete with confidence.

4. Own the decision

The final call belongs to the operator. AI can accelerate the work, but the person ships the consequence.

The pro-tech position is not trust. It is instrumentation.

AI belongs in the stack. But the stack needs tests, logs, source checks, version control, privacy boundaries, human review, and a clear rule for when the model is allowed to act. The more powerful LLMs get, the more this matters.

Sources.

OpenAI: Why language models hallucinate

Hallucinations remain a challenge; uncertainty and benchmark incentives matter.

Stanford HAI: 2026 AI Index Report

Current capability gains, jagged frontier examples, SWE-bench and OSWorld context.

METR: Task-completion time horizons

Why task length, messiness, and low-context work change the automation claim.

Stack Overflow: 2025 Developer Survey

Developer trust, frustration with almost-right output, and AI-agent adoption.

Becker, Rush, Barnes, Rein: AI and experienced developer productivity

RCT finding that early-2025 AI tools slowed experienced developers on mature projects.

NIST AI 600-1: Generative AI Profile

Risk categories including confabulation, privacy, human-AI configuration, and information integrity.