How to Learn Production-Grade AI Agent Engineering: Practical Roadmap

Most engineers first meet AI agents through prompts, tool calling, and a quick demo UI. That is enough to build something interesting, but not enough to build something you can trust in production. The official guidance from OpenAI, Anthropic, Google Cloud, and Microsoft all point in the same direction: production-grade agent engineering is a systems discipline, not a prompt trick. It spans workflow design, context management, tool contracts, memory, approvals, evaluation, tracing, governance, security, reliability, cost, and operations (OpenAI AgentKit, Anthropic context engineering, Google Cloud architecture guidance, Microsoft Agent Framework overview).

TL;DR

Building agents is broader than prompts plus tool calling. The hard part is designing the runtime, context, tools, guardrails, evals, observability, and operating model around the model.
The most-missed topics are context engineering, evals, observability, human approvals, safety/security, reliability engineering, and business metrics.
The best study order is: foundations -> single-agent loops -> context and memory -> tool design -> UI and approvals -> evals -> tracing -> safety and reliability -> deployment and cost -> multi-agent.
Start with a single agent and a strong eval harness. OpenAI, Google Cloud, and Microsoft all explicitly recommend starting simple and only introducing more orchestration when the task actually demands it (OpenAI practical guide, Google Cloud architecture guidance, Microsoft Agent Framework overview).

What you’ll learn here

How to distinguish chatbots, tool-using agents, workflow-based systems, and long-running or multi-agent systems
The full topic map for modern agent engineering
Which topics your current list already covers well, and which important gaps remain
A phased learning roadmap and a 14-week study plan
Three portfolio projects that prove real production skills instead of only demo skills

A simple mental model

Chatbot
  -> single-turn or short-memory conversation

Tool-using agent
  -> model decides which tool to call next

Workflow-based agent system
  -> explicit orchestration + tool calls + approvals + branching

Long-running / multi-agent system
  -> state, memory, background work, delegation, tracing, governance

OpenAI frames the jump clearly: workflows follow predefined steps, while agents start with a goal, plan, use tools, adapt, and request clarification when needed (OpenAI business leader guide). Microsoft makes the same distinction and adds a useful practical rule: if you can solve the task with a normal function or workflow, do that first (Microsoft Agent Framework overview).

A production agent stack

User / Trigger
   |
   v
UI stream or API request
   |
   v
Agent runtime loop
   |
   +--> context loader (state, retrieval, memory)
   +--> model reasoning / planning
   +--> policy + approval gate
   +--> tools / APIs / MCP / connectors
   +--> traces, eval hooks, logs, metrics
   |
   v
Final response or handoff to human

This is why agent engineering is broader than prompt engineering. Anthropic explicitly argues that the center of gravity has shifted from prompt engineering to context engineering, meaning the whole configuration of state, tools, examples, memory, and retrieved information that reaches the model at each step (Anthropic context engineering).

A tiny runtime you should understand before using frameworks

type StepResult = {
  text?: string;
  toolCalls?: Array<{ name: string; args: unknown }>;
  needsApproval?: boolean;
};

async function runAgent(goal: string) {
  const trace = [];
  let state = await loadConversationState(goal);

  for (let step = 0; step < 8; step++) {
    const context = await buildContext(state);
    const result: StepResult = await model.planAct(context);

    trace.push({ step, result });

    if (result.needsApproval) {
      return requestHumanApproval(trace);
    }

    if (!result.toolCalls?.length) {
      await saveTrace(trace);
      return result.text ?? "No answer";
    }

    const toolResults = await runToolsSafely(result.toolCalls);
    state = updateState(state, toolResults);
  }

  return escalateToHuman("step_limit_exceeded");
}

You do not need to ship this exact loop by hand forever. But you do need to understand it. OpenAI, Anthropic, Google Cloud, and Microsoft all expose different frameworks and managed runtimes, yet they all keep the same fundamentals: state, tools, memory, approvals, observability, and explicit control over when the loop should stop (OpenAI practical guide, Anthropic writing tools, Google Cloud ADK docs, Microsoft Agent Framework overview).

1. Executive summary

Building agents is broader than prompts and tool calling because the agent is only one piece of the system. Production systems also need state management, memory, retrieval, tool contracts, UI streaming, approvals, evaluation, tracing, safety, reliability, and governance. OpenAI now splits those concerns across Agent Builder, Evals, trace grading, conversation state, background mode, and safety guidance; Google Cloud does the same with architecture components, ADK, sessions, Memory Bank, tracing, logging, monitoring, and access control; Microsoft separates agents, workflows, observability, governance, and AI platform controls; Anthropic explicitly reframes the problem as context engineering, not just better prompts (OpenAI AgentKit, Google Cloud architecture guidance, Microsoft governance for AI agents, Anthropic context engineering).

The most important missing topics are usually:

Context engineering: what the model sees each turn matters more than endlessly rewriting a system prompt (Anthropic context engineering).
Evals and trace analysis: without them, teams mistake “worked in a demo” for “works reliably” (OpenAI trace grading, Microsoft Foundry evaluation results).
Observability: agent systems need logs, traces, and metrics at the tool-call and workflow level, not just API success/failure (Microsoft observability, Google Cloud ADK docs).
Safety, security, and governance: prompt injection, over-broad permissions, risky MCP integrations, and untracked production agents are operational problems, not optional extras (OpenAI agent safety, OpenAI MCP and connectors, Microsoft governance for AI agents).
Reliability engineering: retries, timeouts, idempotency, backpressure, cost ceilings, and fallback behavior are where real systems either survive or fail.

The highest-level lesson is simple: learn to build one good single-agent system with strong context, tools, evals, and observability before you chase multi-agent complexity. OpenAI recommends starting with a single agent and evolving only when needed, Google Cloud calls single-agent systems the effective starting point, and Microsoft says to prefer a workflow or even a plain function when that is enough (OpenAI practical guide, Google Cloud architecture guidance, Microsoft Agent Framework overview).

2. Complete topic map

Foundations

Model capabilities and limits: know what models are good at, where they drift, and what kinds of reasoning, latency, and tool use they support.
System design basics: queues, retries, idempotency, auth, rate limits, and event-driven design matter as much here as in any backend system.
Workflow vs agent thinking: understand when a rule-based workflow is enough and when adaptive planning is worth it (OpenAI business leader guide, Microsoft Agent Framework overview).

Single-agent basics

Agent loop: prompt -> tool decision -> tool result -> next step -> stop condition.
Tool use fundamentals: function schemas, tool descriptions, result handling, step limits, and failure exits.
Statefulness basics: learn both stateless requests and session-based or threaded continuations (OpenAI conversation state).

Context engineering

Prompt structure: clear sections, examples, output formats, and constraints.
Context budgeting: keep only the highest-signal tokens in play.
Runtime context loading: decide what to preload and what to fetch just in time.
Compaction and summarization: maintain coherence over long tasks without dragging full history forever (Anthropic context engineering).

Memory and retrieval

Short-term memory: conversation/session state for the current task.
Long-term memory: cross-session user preferences, work state, and task history.
Retrieval design: when to preload memories, when to let the model call a retrieval tool, and how to avoid stale or noisy recall.
Hybrid memory strategies: combine externalized state, retrieval, and note-taking instead of dumping everything into one prompt (Google Cloud Memory Bank, Anthropic context engineering).

Tooling and integrations

Tool contract design: make tools self-contained, distinct, ergonomic, and easy for the model to choose correctly.
Integration patterns: raw function tools, API wrappers, MCP, connectors, agent-as-a-tool, and API gateways.
Operational integration quality: auth, scopes, rate limits, retries, idempotency, monitoring, and audit logging (Anthropic writing tools, Google Cloud architecture guidance, OpenAI MCP and connectors).

UI and streaming UX

Streaming responses: tokens, tool events, partial updates, and step progress.
Agent-native UI: approvals, inline actions, stateful chat, and structured widgets rather than plain text bubbles.
Realtime transports: WebSocket and WebRTC matter when latency is part of the product experience (OpenAI AgentKit, OpenAI voice agents, Google Cloud architecture guidance).

Human-in-the-loop workflows

Approval nodes: when users must confirm reads, writes, purchases, messages, or risky actions.
Escalation design: define clear handoff points to humans.
Recovery: allow pause, resume, retry, edit, and override.
Reviewability: make the agent explain what it wants to do before it does it (OpenAI practical guide, OpenAI agent safety).

Evaluations and testing

Deterministic tests: tool wrappers, parsers, policies, serializers, and business logic.
Scenario tests: multi-step tasks with expected outcomes.
LLM-as-judge and graders: useful, but should be calibrated and combined with deterministic checks.
Offline and online evaluation: benchmark before launch, then keep scoring production traces after launch.
Regression discipline: compare runs statistically, not by “felt better” (OpenAI trace grading, Microsoft Foundry evaluation results, Anthropic writing tools).

Observability and tracing

Trace every step: model decisions, tool calls, latencies, retries, approvals, failures, and outcomes.
Centralize logs and metrics: do not bury agent behavior in app logs only.
Semantic conventions: use OpenTelemetry-style traces when possible.
Debugging loops: inspect traces to see why the agent chose the wrong tool or went down the wrong branch (Microsoft observability, OpenAI trace grading).

Safety, security, and governance

Prompt injection defense: assume external inputs are hostile.
Least privilege: limit tool scopes, use approvals, validate inputs, and isolate risky actions.
Secure integrations: prefer trusted MCP servers and review data flows to third parties.
Organizational governance: inventory agents, centralize logs, tag costs, and define ownership and lifecycle policies (OpenAI agent safety, OpenAI MCP and connectors, Microsoft governance for AI agents).

Reliability engineering

Failure handling: tool failures, partial failures, retries, dead letters, fallback models, and safe abort paths.
Operational SLOs: latency, step count, failure rate, escalation rate, and unit economics.
State correctness: idempotency, deduplication, session consistency, and replay safety.
Controlled autonomy: constrain what the agent can do when confidence is low or dependencies are flaky.

Multi-agent orchestration

Real need first: split into multiple agents only when the task genuinely benefits from specialization, isolation, or parallel exploration.
Delegation patterns: manager-worker, agent-as-tool, orchestrator-router, and Agent2Agent (A2A)-based collaboration.
Boundary design: each subagent should have a clear role, context budget, toolset, and permission surface.
Coordination overhead: multi-agent systems add evaluation, security, latency, and cost complexity (Google Cloud architecture guidance, Anthropic subagents).

Cost and latency optimization

Prompt caching: cache stable instructions, tools, and long context where your platform supports it.
Model mix: route cheaper/faster models to simpler work.
Retrieval discipline: avoid sending large irrelevant context every turn.
Runtime optimization: use background jobs, streaming, batching, and async tool execution where appropriate (Anthropic prompt caching, OpenAI voice agents).

Deployment, lifecycle, and product metrics

Deployment lifecycle: dev, staging, shadow mode, limited rollout, production, rollback.
Lifecycle ownership: versioning, approvals, inventory, retirement, and compliance checks.
Product metrics: task completion, containment, escalation, time to resolution, acceptance rate, CSAT, revenue or cost impact.
Leadership visibility: agents should earn their place by measurable business impact, not novelty (Microsoft governance for AI agents).

3. What is missing from this topic list?

The starting list is:

Agents and subagents, stateless and stateful agents
API wrappers, WebSockets, webhooks
UI streaming integration, tool-calling UI
Testing scenarios and LLM-as-judge
Agent observability, tracing and monitoring
Agent design optimization

What is already strong

Agents and subagents, stateless and stateful agents: strong coverage of runtime shape and system topology.
UI streaming integration and tool-calling UI: strong early signal that you are thinking beyond text generation.
Testing scenarios and LLM-as-judge: strong start on evaluation.
Observability, tracing, and monitoring: strong start on production readiness.

What is partially covered

API wrappers, WebSockets, and webhooks: good on transport mechanics, but not enough on tool contracts, auth scopes, approvals, MCP/connectors, error semantics, or enterprise integration patterns.
Agent design optimization: good umbrella term, but too broad. Right now it hides several topics that deserve their own lanes: context engineering, cost/latency tuning, tool ergonomics, prompt structure, and reliability improvements.
Agents and subagents: partially covers orchestration, but it does not automatically cover when not to use multi-agent, which is one of the most important practical judgments (Google Cloud architecture guidance, OpenAI practical guide).

What is missing

Foundations: model behavior, reasoning limits, workflow-vs-agent selection, and backend systems basics.
Context engineering: probably the single biggest missing topic.
Memory and retrieval: short-term state, long-term memory, retrieval design, compaction, note-taking.
Tooling and integrations as a design discipline: tool ergonomics, naming, overlap reduction, MCP governance, auth, rate limiting, and monitoring.
Human-in-the-loop flows: approvals, escalation, pause/resume, safe handoff.
Safety, security, and governance: prompt injection, least privilege, external integrations, agent inventory, cost accountability.
Reliability engineering: retries, timeouts, fallback behavior, idempotency, failure budgets.
Cost and latency optimization: caching, retrieval minimization, model routing, async/background work.
Deployment and lifecycle management: staging, shadow mode, rollout, rollback, versioning, ownership.
Product and business metrics: whether the agent helps the business, not just whether it produces a plausible answer.

What deserves to be split into its own category

Agent design optimization should be split into:
- context engineering
- tool design
- prompt and output design
- cost and latency optimization
- reliability improvements
Testing scenarios and LLM-as-judge should be split into:
- deterministic testing
- scenario testing
- offline evals
- online evals
- trace grading
UI streaming integration, tool-calling UI should be split into:
- streaming UX
- human approval flows
- realtime transports
- agent-native UI patterns
Agents and subagents should be split into:
- single-agent basics
- long-running agents
- multi-agent orchestration
- inter-agent protocols like Agent2Agent (A2A)

4. Learning roadmap

Phase 0. Foundations

What to learn:
- the difference between chatbots, workflows, LLM-powered workflow steps, and agents
- core backend ideas: queues, retries, auth, rate limits, idempotency, tracing
- the basic agent loop and stop conditions
Why it matters:
- if you cannot explain when not to use an agent, you will overbuild
- production agent engineering is applied systems engineering, not a prompt hobby
Suggested mini-projects:
- build a plain workflow that classifies support tickets
- build a single function-calling assistant and compare it to the workflow
- write a short memo: when is the workflow enough, and when do you need an agent?

Phase 1. Single-agent basics

What to learn:
- tool schemas
- agent loops
- step limits
- session state
- structured outputs
Why it matters:
- this is the irreducible core; every framework is abstracting this
Suggested mini-projects:
- support copilot with 3 tools
- local research agent that can search docs and summarize
- “agent as CLI helper” with one approval step

Phase 2. Context engineering

What to learn:
- prompt structure
- example selection
- compaction
- just-in-time retrieval
- note-taking and memory patterns
Why it matters:
- most real agent failures are context failures, not model failures
Suggested mini-projects:
- compare a naive long prompt with a compacted prompt
- add note-taking memory to a research agent
- build a context budget dashboard for each turn

Phase 3. Memory and retrieval

What to learn:
- short-term state vs long-term memory
- memory preload vs tool-triggered retrieval
- retrieval latency and stale context tradeoffs
Why it matters:
- stateful agents only work when memory is precise, cheap, and retrievable
Suggested mini-projects:
- personal preference assistant with cross-session recall
- issue triage agent that remembers prior decisions
- retrieval benchmark: compare preload-all vs retrieve-on-demand

Phase 4. Tooling and integrations

What to learn:
- designing LLM-friendly tools
- MCP/connectors
- auth scopes
- error handling
- monitoring and logging for tool calls
Why it matters:
- bad tools make good models look bad
Suggested mini-projects:
- wrap a real third-party API as tools
- refactor overlapping tools into clearer, higher-signal tools
- add approval policies for sensitive tool calls

Phase 5. UI and human-in-the-loop

What to learn:
- streaming
- progress updates
- approval UX
- fallback to humans
- asynchronous/background jobs
Why it matters:
- agent products succeed or fail at the human interface, not only in backend traces
Suggested mini-projects:
- chat UI with streaming tool status
- approval flow for email sending or ticket updates
- long-running job that can pause, resume, and notify

Phase 6. Evaluations and testing

What to learn:
- golden sets
- deterministic validators
- LLM-as-judge
- trace grading
- regression comparisons
Why it matters:
- without evals, you cannot improve safely
Suggested mini-projects:
- create a 50-case dataset for your support agent
- grade full traces, not just final answers
- compare two prompt or tool versions statistically

Phase 7. Observability, safety, and reliability

What to learn:
- OpenTelemetry-style traces
- centralized logs
- token and cost metrics
- prompt injection defenses
- least privilege
- retries, timeouts, idempotency
Why it matters:
- this is the boundary between “cool demo” and “safe to run for customers”
Suggested mini-projects:
- instrument your agent with traces and dashboards
- simulate tool outages and verify fallback behavior
- add content filtering, approvals, and scope-limited credentials

Phase 8. Deployment, cost, and lifecycle

What to learn:
- rollout strategies
- staging and shadow mode
- prompt caching
- cost controls
- versioning
- ownership and retirement
Why it matters:
- production quality is maintained operationally, not only coded once
Suggested mini-projects:
- deploy an agent with feature flags and rollback
- add token and latency budgets per request
- create an ops dashboard with business KPIs

Phase 9. Multi-agent and long-running systems

What to learn:
- manager-worker patterns
- subagents
- A2A
- long-horizon context strategies
- specialized permissions and isolation
Why it matters:
- this is useful only after you can run a strong single-agent system
Suggested mini-projects:
- research orchestrator with one planner and two specialists
- background incident analysis agent with resumable state
- multi-agent comparison harness measuring whether specialization actually helps

5. 12–16 week study plan

This is a 14-week plan. It is aggressive, but realistic for a working software engineer who wants to actually build.

Week 1

Focus areas: workflows vs agents, basic agent loop, system design basics
Deliverables: one-page architecture notes and a toy single-agent loop
What to read: OpenAI business leader guide, Microsoft Agent Framework overview
What to build: one workflow and one agent solving the same task

Week 2

Focus areas: tool schemas, stop conditions, structured outputs, session state
Deliverables: a single-agent assistant with 2-3 tools
What to read: OpenAI practical guide, Anthropic writing tools
What to build: support copilot with calendar/docs/search tools

Week 3

Focus areas: context engineering basics, prompt structure, examples, output constraints
Deliverables: before/after prompt and context experiments
What to read: Anthropic context engineering
What to build: prompt lab with a simple scorecard

Week 4

Focus areas: compaction, context budgets, note-taking
Deliverables: compaction strategy and long-task notes format
What to read: Anthropic context engineering, OpenAI conversation state
What to build: long-running research agent that summarizes itself every N steps

Week 5

Focus areas: short-term memory, long-term memory, preload vs on-demand retrieval
Deliverables: memory-enabled agent with cross-session recall
What to read: Google Cloud Memory Bank, Google Cloud architecture guidance
What to build: preference-aware assistant with long-term memory

Week 6

Focus areas: tool ergonomics, auth, rate limits, retries, observability at the integration boundary
Deliverables: one polished tool package with logs and failure handling
What to read: Anthropic writing tools, OpenAI MCP and connectors
What to build: tool wrapper for a real SaaS API with approval and retry logic

Week 7

Focus areas: streaming UX, progress events, agent-native UI
Deliverables: streaming interface with tool-progress timeline
What to read: OpenAI AgentKit, Google Cloud architecture guidance
What to build: chat UI that shows tool state, not just final text

Week 8

Focus areas: human approvals, escalation, pause/resume
Deliverables: approval policy matrix and HITL flow
What to read: OpenAI practical guide, OpenAI agent safety
What to build: agent that drafts an email or ticket update but requires human approval to send

Week 9

Focus areas: deterministic tests, datasets, scenario tests
Deliverables: 30-50 case eval dataset
What to read: OpenAI agent evals, Anthropic writing tools
What to build: eval harness for your existing single-agent project

Week 10

Focus areas: trace grading, regression analysis, run comparison
Deliverables: a grading report comparing two prompt/tool versions
What to read: OpenAI trace grading, Microsoft Foundry evaluation results
What to build: trace grader that finds where the agent fails, not just whether it fails

Week 11

Focus areas: tracing, logs, metrics, dashboards
Deliverables: trace explorer and cost/latency dashboard
What to read: Microsoft observability, Google Cloud ADK docs
What to build: OpenTelemetry instrumentation for model, tool, and approval steps

Week 12

Focus areas: prompt injection, least privilege, secure external integrations, governance
Deliverables: threat model for your agent
What to read: OpenAI agent safety, Microsoft governance for AI agents
What to build: hardened version of your project with scoped credentials and approval gates

Week 13

Focus areas: cost and latency optimization, caching, async/background work
Deliverables: cost budget and latency budget per task type
What to read: Anthropic prompt caching, OpenAI voice agents
What to build: optimized runtime with cache-aware prompts and background execution for long tasks

Week 14

Focus areas: multi-agent only where justified, delegation, A2A, specialized permissions
Deliverables: side-by-side comparison of single-agent vs multi-agent performance
What to read: Google Cloud architecture guidance, Anthropic subagents, Microsoft Copilot Studio A2A
What to build: planner + specialist research system, then measure whether it actually beats the single-agent baseline

6. Portfolio projects

Beginner project: Support Copilot

Description: a customer support assistant that can read policy docs, look up order status, draft replies, and ask for approval before sending.
Required components:
- single-agent loop
- 3-4 tools
- session state
- streaming UI
- approval step
- basic eval dataset
What skills it proves:
- single-agent fundamentals
- tool integration
- human-in-the-loop design
- basic evaluation discipline

Intermediate project: Stateful Research Analyst

Description: a research agent that works across sessions, remembers preferences, loads relevant docs or notes on demand, and produces structured reports.
Required components:
- short-term and long-term memory
- retrieval and note-taking
- compaction
- trace logging
- offline evals and trace grading
- cost and latency dashboard
What skills it proves:
- context engineering
- memory and retrieval design
- observability
- evaluation at the workflow level

Advanced project: Production-style Incident or Operations Agent

Description: an internal agent that investigates incidents or operational anomalies, gathers logs and telemetry, proposes actions, and routes risky actions through approvals. It can run long tasks in the background and delegate focused subtasks to specialists.
Required components:
- workflow orchestration
- background execution
- specialist subagents
- approvals and escalation
- robust retries/timeouts/idempotency
- security controls and cost tagging
- business KPI dashboard
What skills it proves:
- production operations thinking
- long-running agent design
- multi-agent tradeoff judgment
- governance, reliability, and lifecycle management

7. Recommended study order

This is the practical order I would recommend:

Learn when not to use an agent.
Build one single-agent loop with a few tools.
Learn context engineering before chasing bigger architectures.
Add state, memory, and retrieval.
Improve tool design and integration quality.
Add streaming UI and human approvals.
Build evals before expanding scope.
Add tracing, metrics, and dashboards.
Harden security, governance, and reliability.
Optimize cost and latency.
Only then experiment with long-running and multi-agent systems.

Why this order works:

It front-loads judgment, not just implementation.
It teaches the core failure modes early.
It avoids the most common mistake in modern agent work: jumping to multi-agent before mastering single-agent quality, context, and evals (OpenAI practical guide, Google Cloud architecture guidance).

8. Best resources by topic

Foundations

Single-agent basics

Context engineering

Memory and retrieval

Tooling and integrations

UI and streaming UX

Human-in-the-loop workflows

Evaluations and testing

Observability and tracing

Safety, security, and governance

Reliability engineering

Multi-agent orchestration

Cost and latency optimization

Deployment, lifecycle, and product metrics

9. Final conclusion

The biggest lessons are not glamorous, but they are what matter:

Start simple. A strong single-agent system beats a premature multi-agent design most of the time.
Context engineering matters more than most teams expect. Prompting is only one part of the problem.
Evals and observability are essential. If you cannot measure behavior and inspect traces, you cannot improve safely.
Security and reliability are underestimated. Prompt injection, over-broad permissions, missing approvals, retries, and fallback behavior are first-class engineering work.
Multi-agent is not the first step. It is an advanced optimization for specialization, isolation, or parallel exploration, not a default architecture.

If I were guiding a team from zero, I would insist on this path:

Build one narrow agent that solves one expensive workflow.
Give it a few excellent tools, not many mediocre ones.
Add evals and tracing before adding more autonomy.
Add memory only when the product truly needs state across turns or sessions.
Add multi-agent only after you can prove the single-agent baseline is understood, measured, and limited.

That is how you move from “we built a cool demo” to “we know how to ship agents that help the business.”

Luis Mori Guerra

Recent Articles

Topics

TL;DR

What you’ll learn here

A simple mental model

A production agent stack

A tiny runtime you should understand before using frameworks

1. Executive summary

2. Complete topic map

Foundations

Single-agent basics

Context engineering

Memory and retrieval

Tooling and integrations

UI and streaming UX

Human-in-the-loop workflows

Evaluations and testing

Observability and tracing

Safety, security, and governance

Reliability engineering

Multi-agent orchestration

Cost and latency optimization

Deployment, lifecycle, and product metrics

3. What is missing from this topic list?

What is already strong

What is partially covered

What is missing

What deserves to be split into its own category

4. Learning roadmap

Phase 0. Foundations

Phase 1. Single-agent basics

Phase 2. Context engineering

Phase 3. Memory and retrieval

Phase 4. Tooling and integrations

Phase 5. UI and human-in-the-loop

Phase 6. Evaluations and testing

Phase 7. Observability, safety, and reliability

Phase 8. Deployment, cost, and lifecycle

Phase 9. Multi-agent and long-running systems

5. 12–16 week study plan

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Week 13

Week 14

6. Portfolio projects

Beginner project: Support Copilot

Intermediate project: Stateful Research Analyst

Advanced project: Production-style Incident or Operations Agent

7. Recommended study order

8. Best resources by topic

Foundations

Single-agent basics

Context engineering

Memory and retrieval

Tooling and integrations

UI and streaming UX

Human-in-the-loop workflows

Evaluations and testing

Observability and tracing

Safety, security, and governance

Reliability engineering

Multi-agent orchestration

Cost and latency optimization

Deployment, lifecycle, and product metrics

9. Final conclusion

Sources

OpenAI

Anthropic

Google Cloud