Tasks That Still Need Human Judgment in AI Code Generation

The pitch for AI code generation is that the typing is over. Describe the feature, accept the diff, ship. For a lot of work, that is basically true now: scaffolding, CRUD endpoints, test stubs, glue code, one-off scripts, and the boring 90% of a familiar feature can be generated faster than you can open the right files.

So here is the more interesting question, the one this article is about: when the machine writes most of the code, what is left for the human?

The short answer is judgment and intuition — and it turns out that is most of the actual job. As code generation gets cheap, the scarce skill is not production; it is deciding what to build, sensing when “correct-looking” output is quietly wrong, and owning what happens when it ships. One engineer, Isacco Sordo, called this the judgment bottleneck: the work did not disappear, the bottleneck just moved.

TL;DR

AI is genuinely good at producing code; it is not good at deciding, sensing, and owning. Those stay human.
Two distinct human skills carry the load: judgment (deliberate, defensible decisions about what is right) and intuition (fast, experience-built suspicion that something is off even when it looks fine).
The tasks that remain human cluster into a handful of areas: deciding what to build, turning ambiguous intent and tacit rules into specs, architecture and trade-offs, taste and conceptual integrity, catching confidently-wrong output, security- and safety-critical reasoning, hard debugging and novel problems, large cross-cutting refactors, and accountability.
A useful rule for how much human to keep: scale it to blast radius × ambiguity. Low-risk, well-specified work can run nearly autonomously; high-blast-radius or fuzzy work needs a human on the load-bearing paths.
This is good news for engineers and PMs: the work that is left is the interesting part.

What You Will Learn Here

The difference between judgment and intuition, and why current AI lacks both
A map of where humans now sit in the agentic coding loop
A concrete catalog of the tasks that still require a human, with why AI struggles at each
A decision map for what to delegate to AI versus what to own yourself
A real “confidently wrong” code example and how a human catches it
A keep-a-human-here checklist and a simple PM / engineer split of responsibilities

Judgment vs. Intuition: Two Different Human Superpowers

People use these words loosely, but they do different work, and AI fails at them for different reasons.

Judgment is deliberate. It is the ability to make a defensible decision under uncertainty, weighing trade-offs, consequences, and context that is not written down anywhere. “Should we build this at all?” “Is Postgres or a queue the right backbone here?” “Is this risk acceptable to ship on a Friday afternoon?” Judgment is slow, explicit, and accountable.

Intuition is fast. It is the pattern recognition you build from years of watching systems break: the quiet “this looks right, which is exactly why I do not trust it.” It is taste — the sense that an abstraction is wrong, a name is misleading, or a function is doing too much, before you can fully articulate why.

Why current AI lacks both:

Human skill	What it requires	Why AI struggles
Judgment	Goals, context, consequences, accountability	A model optimizes for plausible text, not for your goals; it has no stake in the outcome
Intuition	Lived experience of failure; taste	A model has patterns, not scars; it states wrong things in the same confident tone as right ones

Ruchit Suthar puts the intuition point sharply: AI “has no taste, only patterns,” and it “states wrong things in the exact confident tone it states right things.” That confident tone is the trap, and a human’s trained suspicion is the thing that catches it.

Where Humans Sit in the Loop Now

The old loop ran through your keyboard. The new loop runs around it — and the human checkpoints move to the ends and the risky middle.

HUMAN   decide what to build; set intent, constraints, acceptance criteria
  |
  v
AI      generate the boring 90%: scaffolding, CRUD, glue code
  |
  v
GATES   lint · types · tests · security scanners · CI            (automated)
  |
  v
AI      second-pass review (a different model catches many real bugs)
  |
  v
HUMAN   review by RISK: taste · intent gaps · security · load-bearing paths
  |
  v
HUMAN   accept & OWN it — you answer for production

Notice what the human no longer does (type the boring code) and what stays stubbornly human: the bookends (intent and ownership) and the load-bearing middle (risk-scaled review). As Addy Osmani notes, you can and should run a second model to review the first — it catches a lot of real bugs — but it still will not tell you whether this was the right change to build.

The Tasks That Still Need a Human

Here is the field guide. For each task: what it is, why AI struggles, and what the human actually contributes.

1. Deciding what to build (and what not to)

This is the highest-leverage human task and the one AI is least equipped for. A model will happily build the wrong thing beautifully. It has no opinion about whether the feature is worth it, whether a simpler version would test the hypothesis, or whether you should build nothing and delete a requirement instead.

Why AI struggles: it optimizes for completing the task you gave it, not for questioning the task. It has no product context, no roadmap, and no skin in the game.
Human contribution: problem framing, prioritization, and the discipline to say “not this.”

2. Turning ambiguous intent and tacit rules into a spec

Most real requirements live in people’s heads, Slack threads, and a decade of “obvious” business rules nobody wrote down. AI is strong when the spec is precise and weak when it has to infer intent.

Why AI struggles: it reviews the code that exists; it rarely flags the behavior nobody specified. Addy Osmani calls this “the awkward one… a human-shaped gap I do not expect to close soon.”
Human contribution: interrogating stakeholders, surfacing tacit rules, and writing the acceptance criteria the model will be measured against.

3. Architecture and high-consequence trade-offs

AI agents cannot see your architecture. They do not know your data volume, consistency requirements, your team’s operational capacity, or your regulatory constraints, so they produce plausible-sounding but context-free recommendations.

Why AI struggles: architecture is a global optimization across constraints that live outside the repo. Models tend to be locally plausible and globally incoherent.
Human contribution: choosing abstractions and service boundaries, and adjudicating the trade-offs (cost vs. latency vs. simplicity vs. time) that Deloitte’s analysis of the future of engineering names as the core of the modern role.

4. Taste and conceptual integrity

Knowing good from merely functional. Whether a clever one-liner will haunt the next maintainer. Whether the system still hangs together as a coherent whole after the 40th generated pull request.

Why AI struggles: it has patterns, not taste; and across a long, multi-file session it loses track of earlier decisions (context limits and recency bias), so it drifts from its own conventions.
Human contribution: keeping the system conceptually coherent, and applying the human polish AI flattens — good error messages, humane empty states, sensible defaults.

5. Catching “confidently wrong” output

This is intuition’s home turf: AI-generated code that looks correct, passes the happy-path test, and is subtly, expensively wrong. One 2026 analysis of agents working under real engineering constraints found a configuration that hit a 78.6% assertion pass rate but only 8.3% pass@1 — partial correctness that looks impressive and is still too brittle to ship.

Why AI struggles: it states wrong things in the same tone as right ones, and partial correctness is its natural failure mode.
Human contribution: the trained suspicion to distrust clean-looking output and the experience to know where to look. (We will see a concrete example below.)

6. Security- and safety-critical reasoning

Authentication, authorization, cryptography, input handling, and anything where being wrong is a breach. AI code in these areas has a higher chance of subtle vulnerabilities than in pure algorithmic code.

Why AI struggles: security is largely about what should not happen — an absence the model does not see. It reproduces insecure patterns from training data and omits checks nobody asked for explicitly.
Human contribution: threat modeling, deliberate review of the security-sensitive surface, and raising (not lowering) the review bar on generated code.

7. Hard debugging and genuinely novel problems

Debugging a complex distributed system, or solving a problem that is not in the training distribution, still requires reasoning the tools cannot perform reliably.

Why AI struggles: these require holding a large, partly-understood system in mind and reasoning about non-local, emergent behavior.
Human contribution: forming and testing hypotheses across system boundaries; the actual engineering for when there is no Stack Overflow answer.

8. Large, cross-cutting refactors

When a change touches many files, modules, or systems, agents start to lose track of which interfaces changed and which callers and invariants broke.

Why AI struggles: context windows and recency bias make earlier decisions “fade” across long sessions, so big refactors accumulate subtle inconsistencies.
Human contribution: sequencing the refactor, holding the invariants, and being the consistent memory the model lacks.

9. Accountability and ownership

Someone has to answer for the code in production, in the incident review, and in the security audit. That cannot be delegated to a model.

Why AI struggles: responsibility does not scale the way prompts do. As InfoWorld put it, a developer can review one AI change, maybe twenty — but a fleet of agents shipping hundreds of pull requests raises the question: who actually understands what is happening?
Human contribution: ownership. The person who merges it owns it — debugging it later, extending it, and explaining it.

A Decision Map: What to Delegate vs. What to Own

You do not need a human on every line. You need the right human attention on the right changes. The most useful single dial, borrowed from Addy Osmani’s framing, is blast radius: what happens when this breaks? Combine it with ambiguity: how clearly is the work specified?

                     LOW AMBIGUITY                  HIGH AMBIGUITY
                     (clear spec + tests)           (fuzzy intent, tacit rules)

HIGH BLAST RADIUS    REVIEW HARD                    HUMAN-LED
(auth, money,        AI drafts, but a human owns    Decide and spec first.
 data, infra)        the load-bearing review.       AI assists; human owns
                     e.g. payment refactor          design, review, and result.
                                                    e.g. a new authorization model

LOW BLAST RADIUS     AUTOPILOT                      CLARIFY FIRST
(internal tools,     Let AI run; skim the diff.     Cheap to get wrong, but pin
 tests, prototypes)  Spend judgment elsewhere.      down intent so you do not
                     e.g. CRUD, scaffolding         build the wrong thing fast.

The rule of thumb: the amount of human you keep should scale with blast radius, not guilt. Solo project with no users? Letting AI review almost everything is a defensible 2026 position. Maintaining something many people depend on? Let the machine handle the first pass and the boring 90%, but keep a real human on anything that can hurt someone.

A Code Example: The “Confidently Wrong” Diff

Theory is easy; the danger is concrete. Here is a realistic example of generated code that looks right, passes a basic test, and is quietly dangerous.

The prompt: “Add an endpoint to fetch an invoice by id.”

AI output (TypeScript / Express):

// GET /invoices/:id
router.get("/invoices/:id", requireAuth, async (req, res) => {
  const invoice = await db.invoice.findUnique({
    where: { id: req.params.id },
  });

  if (!invoice) {
    return res.status(404).json({ error: "Not found" });
  }

  return res.json(invoice);
});

It is clean. It checks auth. It handles the 404. The happy-path test passes: log in, request your own invoice, get a 200. Ship it?

No. This is a textbook IDOR (insecure direct object reference). requireAuth proves who you are, not what you are allowed to see. Any logged-in user can read any invoice — including other tenants’ — by guessing or enumerating ids. The model produced locally-plausible code and silently skipped the rule nobody wrote down: an invoice belongs to an account, and you can only read your own.

The intuition that fires here is not “I found a bug.” It is “an authorization endpoint that never references the current user is suspicious.” Then judgment fills the gap with the tacit business rule:

// GET /invoices/:id
router.get("/invoices/:id", requireAuth, async (req, res) => {
  const invoice = await db.invoice.findUnique({
    where: { id: req.params.id },
  });

  // Authn proves who you are; authz proves you may see THIS row.
  if (!invoice || invoice.accountId !== req.user.accountId) {
    return res.status(404).json({ error: "Not found" });
  }

  return res.json(invoice);
});

Two things to notice. First, the fix encodes a business rule the prompt never mentioned — exactly the “behavior nobody specified” gap. Second, returning 404 (not 403) for someone else’s invoice is itself a judgment call: do not confirm the row exists to someone who should not know. A model will not make that call for you, because it is not a code-correctness question. It is a human one.

This is the whole article in one diff: AI generated it, automated tests passed, and the value came from a human’s suspicion plus a human’s knowledge of an unwritten rule.

A Keep-a-Human-Here Checklist

Use AI aggressively, but force a human checkpoint when any of these is true:

The change touches auth, money, PII, migrations, or shared infrastructure
The “right thing to build” is not obvious or is contested
Correct behavior depends on business rules that are not in the repo
The change is a large cross-cutting refactor (many files / invariants)
Output looks right but is hard to verify (concurrency, money math, edge cases)
A failure would be expensive or irreversible in production
You would not be comfortable explaining this code in an incident review

Everything else — scaffolding, internal tools, well-tested CRUD, prototypes — is a great place to let the machine run and spend your judgment elsewhere.

How PMs and Engineers Split the Judgment

Because so much of the remaining work is judgment, it is now a shared sport. A clean split:

Question	Who owns it	Why it stays human
What should we build, and why now?	PM (with eng input)	Product judgment, prioritization, tacit user needs
What does “done and correct” mean?	PM + Eng	Acceptance criteria the AI is measured against
How should it be built?	Eng	Architecture, trade-offs, system fit
Is it safe to ship?	Eng	Security, reliability, blast-radius review
Who answers when it breaks?	Eng (and the team)	Accountability does not scale to a model

The healthy pattern is not “PM prompts the AI and merges the pull request.” It is PMs and engineers spending the time AI freed up on the decisions that AI cannot make.

The Skills to Double Down On

If production is cheap, invest in the scarce things:

Build reps that build intuition. Juniors especially still need to trace bugs, read source, and hand-build hard modules — that is how the “this smells wrong” sense is earned. Skip it and you can produce code faster than you can understand it.
Interrogate, do not just generate. Treat the model as a sparring partner: “Why is this right? What are the failure modes? What would break this?” The best engineers review AI output aggressively, like a teammate who joined last week.
Write the spec and the tests. Human-designed tests constrain the model’s search space; if it cannot make them pass cleanly, that is your signal to step in.
Practice owning the result. The differentiator is no longer who can produce code — it is who can decide, sense, and answer for it.

Conclusion

AI did not remove the engineering from software. It removed the typing, which used to disguise the fact that judgment and taste were the real work all along.

The tasks that stay human are not the leftovers. They are the most interesting parts of the job: deciding what is worth building, sensing when something is quietly wrong, and standing behind what ships. When code gets cheap, those become more valuable, not less.

The machine can write the code. Deciding whether it is the right code — and owning what happens next — is still, emphatically, human work.

Sources

Agentic Code Review - Addy Osmani (blast radius, the 441.5% jump in median review duration, “the behavior nobody specified”)
The Judgment Bottleneck: Software Engineering in the Age of AI - Isacco Sordo, June 2026 (judgment bottleneck; “oracle instead of collaborator”; Deloitte synthesis)
Does AI Kill Craft? Taste and Judgment in Generated Code - Ruchit Suthar (“no taste, only patterns,” recognizing subtle wrongness, conceptual integrity)
AI coding agents need good software engineers - InfoWorld (responsibility does not scale; bounded tasks vs. architecture and business rules)
Why AI Coding Agents Fail When Software Gets Real - Antoine Buteau (constraints degrade quality; 78.6% assertion vs. 8.3% pass@1; data-layer failures)
AI Coding Agents in 2026: What They’re Actually Good At - Tanuj Garg (architecture context, security subtleties, large-refactor drift, tacit business logic)
AI coding assistants in 2026: where they help, where they hurt - The Curated Weekly (abstractions, debugging distributed systems, stakeholder discovery)
AI Coding Agents in Engineering: Where They Create Real Leverage - Tamer Fahmy (infrastructure, distributed systems, and security architecture stay human-heavy)
Future of Software Engineering - Deloitte (engineers supervise agents, adjudicate trade-offs, and tackle novel problems)

Luis Mori Guerra

Recent Articles

Topics

Still Human Work: The Tasks That Need Judgment and Intuition When AI Writes the Code