Modern Agent Engineering

Building a Senior-Engineer Agent: Orchestrating Subagents, Loop Engineering, and Auto-PRs

A practical research-backed guide to building a senior-engineer agent that plans work, writes structured subagent prompts, orchestrates execution, and opens pull requests, plus an honest map of the gaps, assumptions, and confidence needed for trustworthy auto-PRs.

15 min read

TL;DR

  • A senior-engineer agent is not a smarter chatbot. It is an orchestrator: it plans work, writes structured prompts for specialist subagents, runs them in isolated workspaces, verifies the result, and only then opens a pull request.
  • The 2026 framing for this is loop engineering: stop prompting agents turn by turn and instead design the loop that prompts them. Boris Cherny (Claude Code lead) put it bluntly: “My job is to write loops.” Addy Osmani named the practice and gave it an anatomy.
  • The hard part is not generating code. It is generating trust. High-confidence auto-PRs need a maker/checker split, a verifiable exit condition (green CI, passing tests, clean diff), and durable state that survives between runs.
  • Borrow the ponytail instinct: the best senior engineer writes the least code that fully solves the problem. An orchestrator that ships less, narrower diffs is easier to review and trust.
  • Start with narrow, binary-pass/fail tasks (dependency bumps, codemods, flaky-test fixes). Earn autonomy incrementally. Do not start by letting an agent design your architecture and merge it at 3 a.m.
  • The biggest gaps today are comprehension debt, reward hacking on the verifier, context/knowledge coverage, and action bias (agents change code even when no change is needed).

What You Will Learn Here

  • What a “senior-engineer agent” actually is, and why orchestration (not raw intelligence) is the design goal.
  • How loop engineering and proactive agent workflows reframe the human role from prompter to loop designer.
  • A concrete reference architecture: planner, prompt-generator, worker subagents, verifier, and PR gate.
  • How to generate well-structured subagent prompts as specs, with a reusable template.
  • A confidence ladder answering the question: how much knowledge do you need before an auto-PR is trustworthy?
  • An honest list of gaps and assumptions, plus where ponytail, Ralph, and multi-agent frameworks fit.

This article is for engineers (and curious PMs) who already use coding agents daily and want to move from “I prompt an agent” to “I run a system that ships reviewable changes.”

The Senior-Engineer Instinct: Ship Less, Be Sure

Before architecture, a mindset. The viral ponytail project encodes the laziest-senior-dev rule into an agent: before writing code, stop at the first rung that holds.

1. Does this need to exist?  -> no: skip it (YAGNI)
2. Stdlib does it?           -> use it
3. Native platform feature?  -> use it
4. Installed dependency?     -> use it
5. One line?                 -> one line
6. Only then: the minimum that works

The point is not golfing tokens. It is that a senior engineer never trades away validation, error handling, security, or accessibility, but also never builds the 120-line cache class the task did not ask for. This matters enormously for automation: a small, necessary diff is reviewable; a sprawling, speculative diff is not. If you want auto-PRs you can trust, the orchestrator must inherit this restraint, not just raw generation power.

This is also the honest reframe of “senior engineer agent.” Seniority is mostly judgment about what not to do: which work to skip, which risk to escalate, when “no change” is the correct answer.

From Prompting to Loop Engineering

In the first week of June 2026, the AI-coding conversation reorganized around one idea. Peter Steinberger (creator of the OpenClaw agent project, now at OpenAI) posted that the skill is no longer prompting coding agents but designing the loops that prompt them. Boris Cherny, who leads Claude Code at Anthropic, described it on the Acquired podcast as: “I don’t prompt Claude anymore. I have loops running that prompt Claude and figure out what to do. My job is to write loops.” Addy Osmani’s widely shared essay gave the practice a name, loop engineering, and an anatomy.

Osmani’s breakdown maps near-identically onto both the Codex app and Claude Code (the convergence is the interesting part). A loop has six parts:

PrimitiveJobWhy it matters for a senior-engineer agent
AutomationsScheduled/triggered discovery + triageTurns “one run” into a heartbeat: find work, surface it to an inbox
WorktreesIsolated git workspacesParallel subagents don’t collide on the same files
SkillsReusable SKILL.md knowledgeStop re-deriving project conventions every cycle
Connectors (MCP)Integrate Jira, Slack, DBs, CIThe difference between “here’s the fix” and “PR opened, ticket updated”
Sub-agentsMaker/checker division of laborThe model that wrote the code grades its own homework too kindly
Durable stateExternal file/board of progressThe model forgets between runs; memory must live outside the context

There is a useful three-floor mental model here:

  • Harness engineering — the runtime around one agent (tools, memory, permissions).
  • Loop engineering — the harness that runs on a schedule, spawns helpers, and feeds itself from disk.
  • Orchestration — fleets of agents across worktrees, PRs, and CI, with failures routed back to the right session.

A senior-engineer agent lives at the loop/orchestration boundary.

Reference Architecture

Here is a practical shape that several open systems converge on (the maker/checker split mirrors human teams). Think of the senior-engineer agent as a thin orchestrator that mostly delegates.

                        ┌─────────────────────────────┐
   trigger (issue,      │   Senior-Engineer Agent      │
   schedule, label) ──> │       (orchestrator)         │
                        └──────────────┬──────────────┘
                                       │ 1. read goal + durable state
                                       v
                              ┌────────────────┐
                              │    PLANNER      │  decompose into atomic,
                              │  (subagent)     │  dependency-aware tasks
                              └───────┬─────────┘
                                      │ 2. emit task graph
                                      v
                          ┌──────────────────────┐
                          │   PROMPT GENERATOR    │  turn each task into a
                          │   (structured spec)   │  self-contained subagent brief
                          └──────────┬────────────┘
                                     │ 3. dispatch (parallel, isolated worktrees)
              ┌──────────────────────┼──────────────────────┐
              v                      v                      v
        ┌───────────┐         ┌───────────┐          ┌───────────┐
        │ WORKER #1 │         │ WORKER #2 │   ...    │ WORKER #N │  implement
        │ (maker)   │         │ (maker)   │          │ (maker)   │
        └─────┬─────┘         └─────┬─────┘          └─────┬─────┘
              └──────────────────────┼──────────────────────┘
                                     v
                            ┌─────────────────┐
                            │    VERIFIER     │  4. build, tests, lint,
                            │    (checker)    │     security, contract checks
                            └───────┬─────────┘
                            pass ┌──┴──┐ fail
                                 v     v
                          ┌────────┐  └─> route failure + diff back to PLANNER
                          │  PR    │      (bounded retries, then escalate)
                          │ GATE   │  5. open PR, update ticket, write state
                          └────────┘

Two design rules make or break this:

  1. The maker never approves itself. A separate verifier (different prompt, ideally read-only tools for review) decides pass/fail. Self-grading is the most common way auto-PRs go wrong.
  2. State lives on disk, not in context. Tasks, status, attempts, and decisions go into a file or board (the Ralph pattern treats the filesystem and git as the primary memory). Each iteration can start fresh and still know what’s done.

Generating Well-Structured Subagent Prompts

The orchestrator’s most underrated job is prompt generation. A subagent fails not because it is dumb but because it received a vague brief. Treat each subagent prompt as a spec, not a sentence.

A reliable structured brief has six fields:

ROLE        narrow job description (one specialty, not "senior engineer")
CONTEXT     only what this task needs: files, conventions, constraints
GOAL        the single outcome, stated as a verifiable condition
NON-GOALS   what to skip (YAGNI / ponytail restraint)
TOOLS       allowed tools + permission boundaries
DONE-WHEN   the exact pass/fail signal (tests, build, diff shape)

Concretely, the orchestrator might emit this for one task in the graph:

ROLE: Backend test-fix specialist. You only fix the failing test and its cause.

CONTEXT:
- Repo uses Vitest. Run tests with `npm test`.
- Failing test: `src/auth/session.test.ts > refreshes expired token`.
- Convention: no new dependencies without escalation.

GOAL: Make the failing test pass without weakening assertions.

NON-GOALS:
- Do not refactor unrelated files.
- Do not add a caching layer or new abstraction.
- If the test is wrong (not the code), stop and report evidence.

TOOLS: Read, Edit, Run(`npm test`, `npm run lint`). No network. No deps.

DONE-WHEN:
- `npm test` is green AND `npm run lint` passes.
- `git diff` touches only `src/auth/**` and the test file.

Notice the spec encodes restraint (NON-GOALS), a binary exit (DONE-WHEN), and an inaction path (“if the test is wrong, stop and report”). That last line is not optional politeness. A May 2026 paper (Coding Agents Don’t Know When to Act) found agents proposed undesirable changes on many no-change tasks. Action bias is a real failure mode; the brief must make “no change” a legal outcome.

A small generator that turns a task-graph node into that brief looks like this:

type Task = {
  role: string;
  goal: string;
  context: string[];
  nonGoals: string[];
  tools: string[];
  doneWhen: string[];
};

function renderBrief(t: Task): string {
  const list = (xs: string[]) => xs.map((x) => `- ${x}`).join("\n");
  return [
    `ROLE: ${t.role}`,
    ``,
    `CONTEXT:\n${list(t.context)}`,
    ``,
    `GOAL: ${t.goal}`,
    ``,
    `NON-GOALS:\n${list(t.nonGoals)}`,
    ``,
    `TOOLS: ${t.tools.join(", ")}`,
    ``,
    `DONE-WHEN:\n${list(t.doneWhen)}`,
  ].join("\n");
}

The orchestrator’s quality is mostly the quality of these briefs. If you can only improve one thing, improve the prompt generator, not the worker model.

How Much Knowledge Do You Need for High-Confidence Auto-PRs?

This is the question the whole system rises or falls on. An auto-PR is only as trustworthy as the verifiable signal behind it. Confidence is not a vibe; it is the product of coverage across a few axes.

Think of it as a confidence ladder. Each rung adds knowledge the system must have before it can open a PR you’d merge unread.

        higher autonomy (merge unread)

   L5 │ Org policy + ownership + change-budget guardrails
   L4 │ Maker/checker split with an INDEPENDENT verifier
   L3 │ Strong, fast verifiable signal (tests + CI + types + lint)
   L2 │ Codebase conventions captured (skills / AGENTS.md)
   L1 │ Task is narrow with a binary pass/fail outcome
   L0 │ Goal is unambiguous and reversible

        lower autonomy (human reviews everything)

The practical reading:

  • L0–L1 (start here). Pick tasks that are reversible and have a binary signal: dependency bumps, codemods, lint fixes, flaky-test repairs. Steinberger’s advice is to start exactly here, run the loop in production for a few weeks, then expand.
  • L2. The agent must know your conventions without re-deriving them. This is what AGENTS.md/CLAUDE.md and skills are for. Missing conventions are the #1 source of “technically correct, culturally wrong” diffs.
  • L3. Auto-PRs are only as good as the test/CI suite they must satisfy. If your verifier is weak, autonomy is dangerous, not impressive. Coverage of the changed surface is the real currency here.
  • L4. A separate verifier prevents the maker from grading itself. Without it, you get reward hacking: the agent makes the signal green by deleting the test, loosening an assertion, or catching-and-ignoring.
  • L5. For unattended merges, you need ownership rules, blast-radius limits (change budgets, protected paths), and policy. This is the gap between “a loop in my terminal” and “a loop the enterprise runs at 3 a.m.”

So, how much knowledge? Enough that the verifier can prove the goal was met without a human re-deriving it. Concretely, for a feature-level auto-PR you want: the task spec, the relevant conventions, a test that encodes the acceptance criteria, an independent reviewer pass, and a bounded blast radius. Anything less and the PR is a draft for a human, which is still useful but is not “high-confidence auto.”

A Minimal Proactive Loop

Proactive agents do not wait for a prompt; they receive a goal, discover work, and surface findings. Here is the smallest honest version of the outer loop, expressed as pseudocode the orchestrator runs per task.

def run_task(task, max_attempts=3):
    worktree = create_worktree(task.id)          # isolation: no file collisions
    state.mark(task, "in_progress")

    for attempt in range(1, max_attempts + 1):
        brief = render_brief(task)               # structured spec, not a sentence
        diff = maker_agent.run(brief, cwd=worktree)

        report = verifier_agent.run(             # independent checker
            diff=diff, suite=["build", "test", "lint", "security"]
        )
        if report.passed and within_change_budget(diff):
            pr = open_pr(worktree, task, report)  # connector does the side effect
            state.mark(task, "pr_open", pr.url)
            return pr

        task = add_feedback(task, report)        # feed failure back in
        state.record_attempt(task, attempt, report)

    escalate_to_human(task, state.history(task)) # bounded retries, then stop

Three properties make this safe rather than reckless:

  • Isolation — each task runs in its own worktree, so parallel makers never corrupt each other.
  • Bounded retries — unbounded “keep trying” wastes money and hides bad assumptions. Cap attempts, then escalate.
  • Durable state — every attempt is recorded outside the model so the loop is resumable and auditable.

You can wrap this in an automation (a schedule or an event like a new agent-ready label) and a triage inbox, where runs that find something surface for review and runs that find nothing archive themselves. That inbox is what turns a one-off script into a loop you actually keep.

Gaps and Assumptions (Read Before You Automate)

Honesty section. The architecture above is sound, but it rests on assumptions that fail in real repos.

  • Assumption: the verifier is trustworthy. If tests are thin, the green checkmark is theater. Reward hacking (deleting/loosening tests to pass) is a documented failure mode. Mitigation: review the test diff separately, and forbid test deletion without escalation.
  • Assumption: conventions are captured. If your norms live only in senior engineers’ heads, the agent cannot follow them. Mitigation: invest in AGENTS.md/skills before autonomy, not after.
  • Comprehension debt. Even when the code is correct, nobody on the team understands it. Merged, unread, agent-written changes accumulate a debt that compounds. Mitigation: keep diffs small (the ponytail instinct) and require human review above L4.
  • Action bias. Agents over-act. The “no change is a valid outcome” path must be explicit in every brief.
  • Cost and latency are not free. Loop engineering benchmarks vary by model: terse reasoning models can spend more thinking than they save, and per-session cost (the ruleset re-injects every turn) can land either way. Measure your own loop.
  • Parallelism multiplies review load. N parallel subagents produce N diffs. If you can’t review them, you didn’t gain leverage, you moved the bottleneck.
  • Enterprise gap. A loop on your laptop is not a governed runtime. Unattended merges need ownership, audit logs, policy, and blast-radius limits that most personal setups lack.

The summary assumption: you are still the engineer. Loop engineering moves the leverage point up a floor; it does not remove you. Build the loop like someone who intends to stay responsible for what it ships.

Where the Ecosystem Is Today

A quick, source-grounded map so you can place these ideas:

  • Loop engineering (Osmani, Cherny, Steinberger) is the conceptual frame: design the loop, not the turn.
  • Ralph / Ralph TUI (after Geoffrey Huntley’s pattern) implements continuous loops that treat git + filesystem as memory and process a backlog of atomic tasks until done.
  • Multi-agent SWE frameworks (e.g., auto-swe-agent with LangGraph state machines, the ALMAS research vision aligning agents to agile roles) formalize the planner → coder → executor → verifier division of labor.
  • ponytail is the cultural counterweight: the best senior engineer ships the least code that fully solves the problem, which is exactly the property that makes auto-PRs reviewable.

These are not competing religions. A good senior-engineer agent borrows the loop from loop engineering, the maker/checker split from multi-agent frameworks, the filesystem-as-memory discipline from Ralph, and the restraint from ponytail.

Existing Gaps in This Article

In the spirit of the confidence ladder, here is what this article does not yet cover:

  • No head-to-head tool benchmark. Codex app, Claude Code, Cursor, and OpenClaw are evolving weekly; specific numbers would age fast.
  • No production security model. Tool permissions, secret handling, and MCP consent for an unattended orchestrator deserve their own article.
  • No real case study. The architecture is synthesized from primary sources and open projects, not from a single measured production deployment.
  • Light on evaluation. “How do you score the orchestrator itself over time?” (regression of merge rate, revert rate, review time) is only sketched.
  • Cost modeling is qualitative. Per-task token/latency/$ budgets are mentioned but not modeled.

Recommended follow-up sections for a future version:

  • a concrete AGENTS.md + skill set tuned for an orchestrator
  • a verifier hardening checklist (anti-reward-hacking rules)
  • a change-budget / protected-path policy example
  • a revert-rate and comprehension-debt dashboard for PMs
  • a worked dependency-bump loop, end to end, with real CI

Sources