Skip to content
DIGITO
Insights

AI is leverage. Not judgment.

LLMs are useful enough to change the workflow. They are not reliable enough to remove ownership. The practical line: use them aggressively when verification is cheap, and slow down when the model cannot carry the cost of being wrong.

Operator rule

If checking is cheaper than creating, use AI.

If checking is harder than doing it yourself, you are not saving time. You are moving risk into review.

use
draft
check
verify
own
decide

utility rises faster than accountability

Coding benchmarks
near 100%

SWE-bench Verified performance moved from roughly 60% to near 100% in one year.

Stanford AI Index 2026

Real computer tasks
~66%

OSWorld task success jumped, but agents still fail about 1 in 3 structured attempts.

Stanford AI Index 2026

Developer trust gap
46%

More developers said they distrust AI output accuracy than trust it.

Stack Overflow 2025

Experienced repo work
+19%

In one RCT, AI tools slowed experienced open-source developers on mature repos.

Becker et al., 2025

The jagged frontier.

The mistake is treating AI capability like one smooth curve. It is not. The same system can look elite in a coding benchmark, miss a simple visual task, produce an excellent first draft, and then invent a citation with total confidence.

That does not make the technology fake. It means the workflow has to be designed around uneven reliability.

Pattern completion94%
Bounded coding tasks84%
Tool-using agents66%
High-context judgment38%
Accountable ownership0%

Illustrative map, not a benchmark. The point is shape: model strength drops when the task moves from pattern work into messy context and human responsibility.

Where it still breaks.

Not anti-AI. Anti-blind-delegation.

Uncertainty.

The model can sound settled when the world is not. OpenAI's own hallucination note frames the issue as a persistent uncertainty problem, not something solved by a smarter next model alone.

Long horizon.

Agents can be excellent at bounded steps and still weak at work that takes many coordinated moves. METR's time-horizon work is useful because it measures reliability against the length of tasks, not just whether a benchmark answer looks right.

Tacit context.

Most real work is not a clean prompt. It depends on old decisions, company politics, hidden constraints, customer history, and knowing which rule matters this time.

Accountability.

An LLM can draft the memo, test, spreadsheet, or summary. It cannot carry the legal, financial, medical, security, or reputational consequence when the output is wrong.

Taste.

Models average patterns. They can produce polished options, but they do not know what you are willing to stand behind unless a human supplies taste, strategy, and refusal.

Security.

Tool access turns mistakes into actions. The risk is not only bad text; it is a model reading files, changing code, leaking context, following prompt injection, or creating review debt.

A developer who refuses AI is not backwards.

Sometimes the right tool is still grep, docs, tests, and a quiet hour. That is especially true in mature systems where the hard part is not generating code, but respecting the existing architecture, edge cases, security posture, and the human promises buried in old decisions.

A 2025 RCT of experienced open-source developers found that AI tools slowed participants down on mature repos, even though the developers expected the opposite. That result will not apply to every team, model, or workflow. But it validates the instinct: if the model creates more review surface than useful work, opting out is rational.

The mature posture is not all-in or all-out. It is knowing when the output is cheap raw material and when it becomes expensive uncertainty.

Use hard

Boilerplate, first drafts, code search, summaries, transforms, test scaffolds, option generation.

Use with a leash

Financial models, SQL, production code, customer-facing copy, research briefs, legal-adjacent language.

Slow down

High-stakes medical, legal, lending, security, employment, safety, privacy, or anything you cannot personally verify.

The operating model.

Treat the model like a very fast junior analyst with broad memory, inconsistent judgment, and no skin in the game.

1. Narrow the task

Give it bounded work: summarize this file, compare these options, draft tests for this function, find the mismatch in this table.

2. Force receipts

Ask for sources, line numbers, calculations, assumptions, and what would change the answer.

3. Verify outside the chat

Open the file. Run the code. Click the link. Check the math. If the output cannot survive reality, it is just autocomplete with confidence.

4. Own the decision

The final call belongs to the operator. AI can accelerate the work, but the person ships the consequence.

The pro-tech position is not trust. It is instrumentation.

AI belongs in the stack. But the stack needs tests, logs, source checks, version control, privacy boundaries, human review, and a clear rule for when the model is allowed to act. The more powerful LLMs get, the more this matters.

Sources.

DIGITO

Where LLMs Still Break | May 2026