Build an Autonomous Astro Content Pipeline with Agno, AgentOS, and OpenRouter

You already have the perfect publish target for this kind of agentic system: an Astro app where a new markdown file in content/articles/ is enough to create a new page. That means the hard part is not rendering. The hard part is building a reliable background pipeline that can research, draft, review, write, commit, and open a pull request without turning your repo into a hallucination machine.

This article shows how I would build that pipeline with Agno for agent/workflow orchestration, AgentOS for runtime and scheduling, and OpenRouter for model access. The architecture matches a real-world split of model responsibilities:

Draft articles: google/gemini-3-flash-preview
Code + PR + workflow shaping: z-ai/glm-5
Final review / critical reasoning: anthropic/claude-4.6-sonnet or anthropic/claude-4.6-opus

As of March 20, 2026, those model identifiers are available in current vendor docs or OpenRouter model pages. The exact slugs may evolve, so treat them as a dated snapshot rather than eternal constants.

TL;DR

Use Agno Workflow for the main publishing pipeline, not a single free-form agent.
Run it through AgentOS so it can execute in the background with sessions, APIs, and schedules.
Use OpenRouter to split model roles cleanly: Gemini 3 Flash for research, GLM-5 for packaging/PR prep, Claude 4.6 Sonnet or Opus for final review.
Write the final artifact into content/articles/*.md, because Astro already auto-discovers new markdown articles in this repo.
Add LangWatch for tracing, evaluations, and Scenario-based regression tests so the system is observable and reliable, not just autonomous.
Keep irreversible actions deterministic: let models propose content, but let Python write files, run git, and open the PR.

The Goal

We want an app that does this:

1. Receive a topic or scheduled content brief
2. Research the web and your internal context
3. Build an Astro-ready markdown article
4. Review it critically before publication
5. Write a new file into content/articles/
6. Create a branch, commit, and open a PR
7. Do all of it in the background, usually within a few minutes

In this repo, the publish directory is content/articles/, not just articles/, so the example below targets the real structure you already have.

Why This Structure Works

This problem is not just “run one agent.” It is a repeatable pipeline with state, checkpoints, file system side effects, and git operations. That makes Agno Workflow a better fit than a single free-form agent.

Agno Workflows are built for defined steps whose output flows to the next step. AgentOS then gives that workflow a production runtime, API surface, sessions, and scheduling. OpenRouter gives us one gateway for multiple model families without rewriting the orchestration code each time.

The split of responsibilities looks like this:

Agno
  ├─ defines agents, schemas, and workflow steps
  ├─ enforces predictable sequencing
  └─ makes it easy to mix model steps and Python executor steps

AgentOS
  ├─ serves the workflow as an API
  ├─ stores run/session history
  ├─ exposes schedules and run management
  └─ makes background automation operational

OpenRouter
  ├─ routes requests to Gemini / GLM / Claude
  ├─ normalizes tool-calling and request shape
  ├─ supports provider controls and fallbacks
  └─ keeps model switching mostly configuration-level

Astro
  ├─ needs only a markdown file in content/articles/
  ├─ auto-discovers the article
  └─ turns git changes into a previewable site via PR

The Reference File We Are Studying

To make this teachable, I am going to center the explanation on one reference file:

article_publisher.py

Its purpose is simple:

Take a content request in,
run a multi-model workflow,
emit one Astro markdown file,
then create a GitHub PR.

Here is the high-level file structure:

from __future__ import annotations

import os
import re
import subprocess
from datetime import datetime, timezone
from pathlib import Path
from typing import Literal
from uuid import uuid4

from agno.agent import Agent
from agno.db.sqlite import SqliteDb
from agno.models.openrouter import OpenRouterResponses
from agno.os import AgentOS
from agno.tools.duckduckgo import DuckDuckGoTools
from agno.workflow import Step, Workflow
from agno.workflow.types import StepInput, StepOutput
from pydantic import BaseModel, Field

At a glance, the file has five responsibilities:

Declare typed schemas for every stage.
Configure model-specific agents through OpenRouter.
Define deterministic workflow steps.
Perform file system and git side effects in Python executor steps.
Serve and schedule the workflow through AgentOS.

Critical Code Section 1: Typed Contracts Between Steps

The most important architectural choice is not the prompt. It is the schema boundary between steps.

class ResearchPacket(BaseModel):
    topic: str
    angle: str = Field(description="Why this angle is interesting now")
    audience: str
    key_claims: list[str]
    sources: list[str] = Field(description="High quality source URLs")
    outline: list[str]
    risks: list[str] = Field(description="Areas that need verification")


class AstroArticlePackage(BaseModel):
    title: str
    excerpt: str
    tags: list[str]
    slug: str
    seo_title: str
    seo_description: str
    body_markdown: str
    commit_message: str
    pr_title: str
    pr_body: str


class ReviewDecision(BaseModel):
    publish_ready: bool
    blocking_issues: list[str]
    revision_notes: list[str]
    final_title: str | None = None
    final_excerpt: str | None = None

Why this matters:

It prevents every step from guessing what the previous step meant.
It makes failures visible earlier.
It lets you stop before touching git if quality is not high enough.
It keeps the workflow explainable when you revisit a failed run a day later.

Without schemas, your pipeline becomes “some markdown-ish text went into another prompt and now we hope for the best.”

Critical Code Section 2: Model Specialization Through OpenRouter

This is where the ideal real-world setup becomes concrete.

def openrouter_model(primary: str, fallbacks: list[str] | None = None) -> OpenRouterResponses:
    models = [primary, *(fallbacks or [])]
    return OpenRouterResponses(
        id=primary,
        models=models,
        app_name="astro-content-factory",
    )


research_agent = Agent(
    name="Research Drafter",
    model=openrouter_model(
        "google/gemini-3-flash-preview",
        ["google/gemini-2.5-flash"],
    ),
    tools=[DuckDuckGoTools()],
    output_schema=ResearchPacket,
    instructions="""
    Research the topic with current web evidence.
    Prefer primary sources.
    Return a tight article angle, source list, outline, and explicit uncertainty notes.
    """,
)


packaging_agent = Agent(
    name="Astro Packager",
    model=openrouter_model(
        "z-ai/glm-5",
        ["z-ai/glm-5-turbo"],
    ),
    output_schema=AstroArticlePackage,
    instructions="""
    Convert the research packet into a production-ready Astro markdown article package.
    The output must be repo-ready: title, excerpt, tags, slug, SEO fields, markdown body,
    commit message, PR title, and PR body.
    """,
)


review_agent = Agent(
    name="Critical Reviewer",
    model=openrouter_model(
        "anthropic/claude-4.6-sonnet",
        ["anthropic/claude-4.6-opus"],
    ),
    output_schema=ReviewDecision,
    instructions="""
    Review the article package like a demanding editor.
    Block publication if the claims are weak, unsupported, repetitive, or misleading.
    Approve only if the article is publishable with minimal editorial risk.
    """,
)

This is the exact pattern I like:

Gemini 3 Flash does the high-throughput research synthesis.
GLM-5 produces the repo-aware package and PR metadata.
Claude 4.6 Sonnet or Opus plays the hard-nosed editor.

That is not just model fan fiction. It encodes a practical separation of labor:

cheap/fast model for exploration
strong engineering model for structured package generation
expensive/reliable model for final judgment

Critical Code Section 3: Deterministic Workflow Steps

The core pipeline should be a Workflow, not a chat loop.

def normalize_request(step_input: StepInput) -> StepOutput:
    raw = str(step_input.input or "").strip()
    if not raw:
        return StepOutput(
            step_name="normalize_request",
            content="Missing topic",
            stop=True,
        )

    return StepOutput(
        step_name="normalize_request",
        content={
            "topic": raw,
            "requested_at": datetime.now(timezone.utc).isoformat(),
        },
    )


def quality_gate(step_input: StepInput) -> StepOutput:
    review = step_input.previous_step_content
    if not isinstance(review, ReviewDecision):
        return StepOutput(
            step_name="quality_gate",
            content="Review step did not return ReviewDecision",
            stop=True,
        )

    if not review.publish_ready:
        return StepOutput(
            step_name="quality_gate",
            content={
                "status": "blocked",
                "issues": review.blocking_issues,
                "notes": review.revision_notes,
            },
            stop=True,
        )

    return StepOutput(
        step_name="quality_gate",
        content={"status": "approved"},
    )


workflow = Workflow(
    name="astro_article_publisher",
    db=SqliteDb(db_file="tmp/article_publisher.db"),
    add_workflow_history_to_steps=True,
    num_history_runs=10,
    steps=[
        Step(name="normalize_request", executor=normalize_request),
        Step(name="research", agent=research_agent),
        Step(name="package_for_astro", agent=packaging_agent),
        Step(name="review", agent=review_agent),
        Step(name="quality_gate", executor=quality_gate),
        Step(name="write_article", executor=write_article_file),
        Step(name="open_pull_request", executor=open_pull_request),
    ],
)

This section is the heart of the system.

Why?

It makes the flow auditable.
It gives each side effect a named checkpoint.
It separates “thinking” steps from “do irreversible stuff” steps.
It lets you stop safely before file writes and PR creation.

Notice the order:

research -> package -> review -> gate -> write -> PR

That is deliberate. The git side effects happen after critical review, not before.

Critical Code Section 4: Writing the Astro Markdown File Safely

This is where the abstract workflow becomes a publishing system.

REPO_ROOT = Path(os.environ["ASTRO_REPO_ROOT"]).resolve()
ARTICLE_DIR = REPO_ROOT / "content" / "articles"


def slugify(value: str) -> str:
    normalized = re.sub(r"[^a-zA-Z0-9]+", "-", value.lower()).strip("-")
    return re.sub(r"-{2,}", "-", normalized)


def write_article_file(step_input: StepInput) -> StepOutput:
    package = step_input.get_step_content("package_for_astro")
    review = step_input.get_step_content("review")

    if not isinstance(package, AstroArticlePackage):
        raise ValueError("package_for_astro must return AstroArticlePackage")

    slug = slugify(package.slug or package.title)
    article_path = ARTICLE_DIR / f"{slug}.md"

    if article_path.exists():
        raise FileExistsError(f"Refusing to overwrite existing article: {article_path}")

    final_title = review.final_title if isinstance(review, ReviewDecision) and review.final_title else package.title
    final_excerpt = (
        review.final_excerpt
        if isinstance(review, ReviewDecision) and review.final_excerpt
        else package.excerpt
    )

    markdown = f'''---
title: "{final_title}"
excerpt: "{final_excerpt}"
tags: {package.tags!r}
date: "{datetime.now().date().isoformat()}"
seoTitle: "{package.seo_title}"
seoDescription: "{package.seo_description}"
canonical: "https://luismori.dev/article/{slug}"
---

{package.body_markdown}
'''

    article_path.write_text(markdown, encoding="utf-8")

    return StepOutput(
        step_name="write_article",
        content={
            "status": "written",
            "slug": slug,
            "path": str(article_path),
        },
    )

This function is deceptively simple. It carries several important production ideas:

Idempotency: do not silently overwrite an existing slug.
Separation of concerns: the model proposes content, Python performs the file write.
Last-mile normalization: final slug and final title are resolved outside the model.
Repo-specific correctness: it writes to content/articles/, which is what this Astro app actually loads.

Critical Code Section 5: Branch, Commit, and PR Automation

This is the point where many demos become fake. They generate markdown, but a human still has to do the git work. If you want true background publishing, the workflow must finish the job.

def run(cmd: list[str], cwd: Path) -> str:
    completed = subprocess.run(
        cmd,
        cwd=cwd,
        check=True,
        text=True,
        capture_output=True,
    )
    return completed.stdout.strip()


def open_pull_request(step_input: StepInput) -> StepOutput:
    package = step_input.get_step_content("package_for_astro")
    write_result = step_input.get_step_content("write_article")

    if not isinstance(package, AstroArticlePackage):
        raise ValueError("package_for_astro must return AstroArticlePackage")

    branch = f"codex/article-{uuid4().hex[:8]}"

    run(["git", "checkout", "-b", branch], cwd=REPO_ROOT)
    run(["git", "add", str(write_result["path"])], cwd=REPO_ROOT)
    run(["git", "commit", "-m", package.commit_message], cwd=REPO_ROOT)
    run(["git", "push", "-u", "origin", branch], cwd=REPO_ROOT)

    pr_url = run(
        [
            "gh",
            "pr",
            "create",
            "--base",
            "main",
            "--title",
            package.pr_title,
            "--body",
            package.pr_body,
        ],
        cwd=REPO_ROOT,
    )

    return StepOutput(
        step_name="open_pull_request",
        content={"branch": branch, "pr_url": pr_url},
    )

This is where GLM-5 earns its keep. The model is not executing git. Python is. But GLM-5 is supplying the structured metadata that makes the git layer clean and deterministic: commit message, PR title, PR body, and repo-shaped markdown.

Critical Code Section 6: Serving the Workflow with AgentOS

Now we make it operational.

agent_os = AgentOS(
    id="astro-content-factory",
    workflows=[workflow],
    scheduler=True,
)

app = agent_os.get_app()

if __name__ == "__main__":
    agent_os.serve(app=app, port=7777, reload=True)

This small section changes the whole system:

your workflow becomes an API
sessions and run history can be persisted
schedules can trigger it
other services can call it remotely

That is the difference between “a Python script that writes markdown” and “an agent app.”

The Control Flow Diagram

┌──────────────────────┐
│ Topic / Content Brief│
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ normalize_request    │
│ validate + timestamp │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ research             │
│ Gemini 3 Flash       │
│ + web tools          │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ package_for_astro    │
│ GLM-5                │
│ markdown + PR data   │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ review               │
│ Claude 4.6           │
│ approve / block      │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│ quality_gate         │
│ stop if weak         │
└───────┬────────┬─────┘
        │        │
   blocked       approved
        │        │
        ▼        ▼
   run ends   write_article
                  │
                  ▼
           open_pull_request
                  │
                  ▼
            branch + PR URL

The Data Flow Diagram

Raw topic
  │
  ▼
{ topic, requested_at }
  │
  ▼
ResearchPacket
  ├─ angle
  ├─ sources
  ├─ key_claims
  ├─ outline
  └─ risks
  │
  ▼
AstroArticlePackage
  ├─ title
  ├─ excerpt
  ├─ tags
  ├─ slug
  ├─ seo_title
  ├─ seo_description
  ├─ body_markdown
  ├─ commit_message
  ├─ pr_title
  └─ pr_body
  │
  ▼
ReviewDecision
  ├─ publish_ready
  ├─ blocking_issues
  ├─ revision_notes
  ├─ final_title?
  └─ final_excerpt?
  │
  ▼
File write result
  ├─ slug
  ├─ path
  └─ status
  │
  ▼
PR result
  ├─ branch
  └─ pr_url

The Component Relationship Diagram

                    ┌─────────────────────────────┐
                    │         AgentOS             │
                    │ API + sessions + schedules  │
                    └──────────────┬──────────────┘
                                   │
                                   ▼
                    ┌─────────────────────────────┐
                    │   Agno Workflow             │
                    │ astro_article_publisher     │
                    └──────┬───────────┬──────────┘
                           │           │
                           │           ▼
                           │    Python Executors
                           │    ├─ normalize_request
                           │    ├─ quality_gate
                           │    ├─ write_article_file
                           │    └─ open_pull_request
                           │
                           ▼
                     Agno Agents
                     ├─ research_agent
                     ├─ packaging_agent
                     └─ review_agent
                           │
                           ▼
                     OpenRouter
                     ├─ Gemini 3 Flash Preview
                     ├─ GLM-5
                     └─ Claude 4.6 Sonnet / Opus
                           │
                           ▼
                     External Systems
                     ├─ Web search
                     ├─ Astro repo filesystem
                     └─ GitHub CLI / PR API

How to Run It in the Background

This is the most important real-world distinction in the whole article.

If your goal is “all in background and in minutes automatically,” do not make the core publish pipeline a background hook.

Background hooks in AgentOS are great for:

analytics
notifications
evaluations
webhooks

They are not the right primitive for the main publication path because the docs explicitly note that background hooks run after the response is sent and cannot modify the request or response. They are for non-critical side work, not for the main artifact-creation path.

For the main job, use one of these patterns:

AgentOS schedule
An API-triggered workflow run
An external queue or cron that calls the workflow endpoint

The cleanest version for content automation is:

Scheduler/cron
  -> POST AgentOS workflow endpoint
  -> workflow runs
  -> article file is written
  -> PR is opened
  -> run history is stored

Example: schedule the workflow

AgentOS exposes schedule management APIs, so a separate bootstrap script can create a recurring job:

import httpx

client = httpx.Client(base_url="http://localhost:7777", timeout=30)

client.post(
    "/schedules",
    json={
        "name": "daily-article-run",
        "cron_expr": "0 9 * * 1-5",
        "endpoint": "/workflows/astro_article_publisher/runs",
        "method": "POST",
        "payload": {"message": "Research and draft one article about practical AI engineering"},
        "timezone": "America/Lima",
        "max_retries": 2,
        "retry_delay_seconds": 60,
    },
)

That gives you recurring runs without turning the publish logic into an awkward side effect of some unrelated request.

Add LangWatch for Monitoring, Observability, and Reliability

Autonomy without observability is just faster failure.

If this pipeline is going to run in the background and open PRs by itself, you need a reliability layer that answers:

Which step failed?
Which model produced the bad output?
Which source list was used?
Did quality regress after a prompt or model change?
Are latency, token cost, and pass rate getting worse over time?

This is where LangWatch fits. LangWatch’s current docs position it as an observability and tracing platform built on OpenTelemetry, with SDKs for Python, TypeScript, and Go, plus agent simulations through Scenario.

The simplest mental model is:

AgentOS workflow runs the pipeline
         │
         ▼
LangWatch trace wraps the full run
         │
         ├─ span: normalize_request
         ├─ span: research
         ├─ span: package_for_astro
         ├─ span: review
         ├─ span: write_article
         └─ span: open_pull_request
         │
         ▼
Evaluations + dashboards + alerts + scenario history

Critical Code Section 7: Instrument the Workflow with LangWatch

The current LangWatch Python guide recommends calling langwatch.setup() early, then using traces and spans to capture end-to-end operations and nested steps.

import langwatch

langwatch.setup()


@langwatch.trace(name="astro_article_publish_run")
def run_publish_pipeline(topic: str, session_id: str, user_id: str = "content-bot"):
    current_trace = langwatch.get_current_trace()
    current_trace.update(
        metadata={
            "thread_id": session_id,
            "user_id": user_id,
            "workflow_name": "astro_article_publisher",
            "topic": topic,
        }
    )

    return workflow.run(input=topic, session_id=session_id)

This gives you a top-level trace for the whole publication attempt. Under that trace, each important function can be decorated as a span:

@langwatch.span(name="write_article_file")
def write_article_file(step_input: StepInput) -> StepOutput:
    package = step_input.get_step_content("package_for_astro")
    article_path = ARTICLE_DIR / f"{slugify(package.slug or package.title)}.md"

    markdown = render_markdown(package)
    article_path.write_text(markdown, encoding="utf-8")

    langwatch.get_current_span().update(
        input={"slug": package.slug, "title": package.title},
        output={"path": str(article_path), "status": "written"},
    )

    return StepOutput(
        step_name="write_article",
        content={"status": "written", "path": str(article_path)},
    )

That span-level instrumentation matters because when a run goes wrong, you usually do not care that “the workflow failed.” You care that:

the research step used weak sources
the review step blocked the article
the git step failed after branch creation

LangWatch traces make those failures inspectable rather than mysterious.

Critical Code Section 8: Attach Evaluations to the Quality Gate

LangWatch’s current evaluation docs describe three useful mechanisms:

client-side custom evaluations via add_evaluation()
server-side managed evaluations
guardrails that can influence application flow

For this publishing pipeline, the quality_gate step is the natural place to log these.

@langwatch.span(name="quality_gate")
def quality_gate(step_input: StepInput) -> StepOutput:
    review = step_input.previous_step_content

    if not isinstance(review, ReviewDecision):
        langwatch.get_current_span().add_evaluation(
            name="review-contract-valid",
            passed=False,
            details="Review step did not return ReviewDecision",
            is_guardrail=True,
        )
        return StepOutput(step_name="quality_gate", content="Invalid review payload", stop=True)

    passed = review.publish_ready and not review.blocking_issues
    langwatch.get_current_span().add_evaluation(
        name="publish-ready",
        passed=passed,
        details={
            "blocking_issues": review.blocking_issues,
            "revision_notes": review.revision_notes,
        },
        is_guardrail=True,
    )

    if not passed:
        return StepOutput(
            step_name="quality_gate",
            content={"status": "blocked", "issues": review.blocking_issues},
            stop=True,
        )

    return StepOutput(step_name="quality_gate", content={"status": "approved"})

This is a high-leverage pattern:

the workflow still decides whether to continue
LangWatch records the decision as structured evaluation data
you can track pass/fail rate over time across runs, prompts, or model swaps

That transforms “Claude said this article was weak” into an observable metric.

LangWatch Reliability Architecture

Scheduler / API Trigger
        │
        ▼
AgentOS workflow run
        │
        ▼
LangWatch trace
  ├─ metadata: topic, session_id, user_id, workflow_name
  ├─ span: research
  │    └─ evals: source_quality, freshness_check
  ├─ span: package_for_astro
  │    └─ evals: schema_valid, frontmatter_valid
  ├─ span: review
  │    └─ evals: publish_ready, citation_risk
  ├─ span: write_article
  │    └─ evals: file_written
  └─ span: open_pull_request
       └─ evals: pr_opened
        │
        ▼
Dashboard views
  ├─ pass rate by topic type
  ├─ failure rate by step
  ├─ latency by model
  ├─ token cost by run
  └─ regression history by prompt version

Test-Based Scenarios with LangWatch Scenario

Observability tells you what happened in production. Scenario tests help you prevent regressions before production.

LangWatch’s Scenario framework is simulation-based testing for agents. Instead of writing fragile input-output assertions, you describe a realistic situation, let a simulated user interact with your agent, and judge the behavior at checkpoints during the conversation.

That is a much better fit for this content pipeline because reliability is not just about string equality. It is about whether the system:

asks for clarification when the topic is too vague
cites current, credible sources
avoids duplicate articles
blocks publication when the review is negative
never opens a PR for a rejected draft

Critical Code Section 9: Wrap the Publishing App as a Scenario Adapter

Scenario works through an AgentAdapter interface with a call() method. For this workflow, the adapter can invoke the AgentOS workflow endpoint or call the local workflow.run() directly in tests.

import scenario


class ArticlePublisherAdapter(scenario.AgentAdapter):
    async def call(self, input: scenario.AgentInput):
        user_message = input.messages[-1]["content"]
        result = workflow.run(input=user_message, session_id="scenario-test")

        if result.content and isinstance(result.content, dict):
            if result.content.get("status") == "blocked":
                return f"Publication blocked: {result.content}"

        return str(result.content)

Critical Code Section 10: Scenario Tests for Editorial Reliability

This is where reliability becomes concrete.

import pytest
import scenario


@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_rejects_weakly_supported_topic():
    result = await scenario.run(
        name="reject weak sourcing",
        description="""
        The user asks for an article on a trending claim with weak evidence.
        The pipeline should surface sourcing risk and avoid opening a PR.
        """,
        agents=[
            ArticlePublisherAdapter(),
            scenario.UserSimulatorAgent(),
            scenario.JudgeAgent(),
        ],
        script=[
            scenario.user("Write an article claiming framework X is dead with no primary sources"),
            scenario.agent(),
            scenario.judge(criteria=[
                "The system identifies evidence weakness or uncertainty",
                "The system does not claim the article is ready to publish",
                "The system does not proceed as if a PR was opened",
            ]),
        ],
    )

    assert result.success

You would then add more scenarios for the real failure modes of the app:

Duplicate topic scenario: a near-identical article already exists in content/articles/
Bad source scenario: sources are low-quality SEO spam or forum hearsay
Review-block scenario: the final reviewer rejects the draft and the pipeline must stop cleanly
Git failure scenario: branch or PR creation fails and the run must surface the failure without pretending success
Happy path scenario: a well-scoped topic produces a clean markdown file and PR metadata

Scenario Coverage Map

Risk                           Scenario Test
────────────────────────────   ───────────────────────────────────────
Hallucinated claims            judge checks for explicit uncertainty
Weak sourcing                  judge checks source credibility behavior
Duplicate content              agent must detect existing article overlap
Bad frontmatter                deterministic validator + scenario checkpoint
PR opened after failed review  scenario ensures blocked runs stop early
Prompt/model regression        batch scenario history shows pass-rate drift

How LangWatch Helps After the Scenario Runs

The current LangWatch simulation docs emphasize that once Scenario is connected, runs can be visualized in the LangWatch platform, where you can:

organize simulations into sets and batches
inspect full conversations
debug failing runs
track performance over time with run history

That means the workflow’s reliability loop becomes:

Change prompt / model / tool
        │
        ▼
Run scenario suite
        │
        ▼
View failures in LangWatch
        │
        ▼
Inspect trace + judge output + conversation
        │
        ▼
Fix workflow / prompts / gating
        │
        ▼
Re-run until pass rate stabilizes

What to Measure in Production

Once LangWatch is attached, I would monitor these metrics first:

publish-ready rate by topic category
blocked rate by reviewer model
PR-open success rate
median and P95 latency per workflow step
token cost per successful article
source-quality failures over time
duplicate-detection failures
scenario pass rate on the regression suite

If I had to choose only one practical operating rule, it would be this:

Every production incident should become a Scenario regression test.

That is how observability turns into reliability instead of just dashboards.

Five Levels of Understanding

Basic Level

At the most basic level, this app is an automatic content factory for an Astro site.

It takes a topic, asks one model to research it, asks another model to convert that research into a markdown article, asks a stricter model to review it, uses LangWatch to trace and score the run, and only then writes the article into content/articles/ and opens a PR.

The purpose is not “AI writes blogs.” The purpose is:

save time on the repetitive publishing workflow
keep the repo as the source of truth
preserve a human-review checkpoint through the pull request

The most important thing to understand at this level is that the output artifact is not a chat response. It is a git change.

Medium Level

At the medium level, you should see the system as a pipeline of specialized components.

The key moving parts are:

research agent for evidence gathering and outline generation
packaging agent for transforming research into Astro-ready markdown and git metadata
review agent for approval or rejection
LangWatch tracing and evaluations for visibility into each run
Scenario regression tests for realistic reliability coverage
executor steps for safe local actions like file writes and git commands
workflow session storage so repeated runs can remember what they already produced

The key methods and functions are:

output_schema on each agent to enforce typed results
Step(...) to make each stage explicit and named
StepInput.get_step_content(...) to read prior results safely
StepOutput(stop=True) to halt unsafe or low-quality runs
AgentOS(...).get_app() to expose the workflow as an API

This is also the level where the model split starts to feel natural:

Gemini is good at fast research drafting
GLM-5 is good at engineering-shaped packaging
Claude is good at saying “no” when quality is not there

LangWatch adds the next layer of discipline:

traces tell you what happened
evaluations tell you whether it was good
scenarios tell you whether it stays good after changes

Advanced Level

At the advanced level, the file reveals three strong design patterns.

1. Schema-driven orchestration

Every step speaks through a typed contract. This prevents prompt drift from compounding across the pipeline.

2. Hybrid workflow design

The system is not pure agentic improvisation. It mixes:

agent steps for language-heavy work
Python executor steps for deterministic local actions
observability spans and evaluations for measurable behavior

That is a mature pattern. Models should decide content. Python should own side effects.

3. Post-review side effects

The workflow deliberately delays file writes and PR creation until after review. This is a safety-first architectural decision.

This level is also where AgentOS matters beyond “serving an app.” It gives you:

remote execution
schedules
session persistence
workflow history

Workflow history is especially valuable here because it lets future runs access prior executions. That helps the app avoid producing three nearly identical articles over a week.

LangWatch complements that by giving you a separate operational history:

run traces
step-level latency
evaluation scores
scenario batch outcomes

Expert Level

At the expert level, you stop asking “does it work?” and start asking “how does it fail?”

Here are the main edge cases and operational concerns.

Duplicate topics

If the same topic is requested twice, you can get duplicate content or slug collisions. The file already blocks overwriting existing slugs, but in practice you should also:

compare against recent workflow history
compare against existing article titles/slugs
reject near-duplicate topics before drafting

Weak sources

A research agent with search tools can still return shallow or circular evidence. That is why the research schema includes:

explicit source URLs
key claims
risks and uncertainty notes

The reviewer should fail the run if claims are not actually supported.

Git side effects in dirty repos

If the working tree is dirty, a branch-and-commit step can accidentally mix unrelated changes. In production, you should either:

run in a clean clone/worktree
or validate cleanliness before any git step

PR creation failures

gh pr create can fail because of auth, rate limits, missing remotes, or branch protections. The workflow should capture and persist those failures as run output so the run is observable rather than silently broken.

Cost and latency

This stack is fast enough for “minutes, not hours,” but only if you keep the steps narrow:

Gemini research should produce structured notes, not a full essay plus notes plus summary plus alternative version
GLM-5 should package, not re-research
Claude should review, not rewrite the entire article from scratch

In other words: narrow prompts, typed outputs, and limited step scope are performance optimizations.

LangWatch helps prove whether those optimizations are working because you can watch:

latency drift after prompt changes
cost increases after model swaps
approval-rate regressions after agent edits
repeated failures clustered around one step

Why background hooks are the wrong core primitive

This deserves repeating because it is subtle. AgentOS background hooks are sequential after the response and are meant for non-critical work. They are excellent for:

analytics
notifications
async evaluations

But your core publish path must remain a first-class workflow or scheduled run, because it needs to produce the primary artifact and preserve control over success/failure semantics.

LangWatch is a good match here because it observes the first-class workflow directly instead of living inside a side-channel-only architecture.

Legendary Level

At the legendary level, the interesting question is not “how do I publish one article?” It is “what system am I creating if this works?”

You are effectively building a content operations platform backed by git.

That has several long-term implications.

1. The repo becomes the publication database

That is powerful. The workflow is stateless at the LLM boundary, but stateful at the git boundary:

article versions live in commits
editorial review lives in PR comments
deployment preview lives in the preview environment

This is much better than storing AI-generated posts in some opaque database row.

2. Your moat becomes evaluation, not generation

Any team can wire a model to emit markdown. The hard part is building the surrounding system that answers:

Was the topic worth writing about?
Were the sources good?
Did the article duplicate existing content?
Was the PR clean and reviewable?

The defensible part of the system is not the draft. It is the gating, memory, and quality loop.

This is exactly where LangWatch Scenario becomes strategically important. Once you have a bank of realistic scenarios and pass/fail history, your system stops being “a prompt that currently works” and starts becoming “a workflow with measurable reliability.”

3. You will eventually want a queue and a content registry

The single-file demo is enough to prove the pattern. A scaled system will usually grow into:

a topics table or queue
deduplication scoring
article status states like queued, researching, drafted, blocked, pr_opened, merged
observability and eval traces
approval workflows for high-impact topics

At that point, AgentOS remains a strong orchestration/runtime layer, but you will likely add:

Postgres instead of SQLite
a worker process or task queue
content scoring and retrieval over prior posts
automated evals after each run

4. The cleanest future improvement is a two-pass review system

If this were my long-term architecture, I would evolve it into:

Pass 1: fast reviewer
  -> catches obvious issues cheaply

Pass 2: expensive reviewer
  -> runs only if the article is close to publishable

That gives you better cost control while still preserving strong final judgment.

5. The highest-leverage improvement is “research memory”

Instead of starting each run from zero, I would store:

previously used sources
rejected claims
accepted article angles
tag/topic coverage gaps

Then the workflow becomes editorially smarter over time instead of merely faster.

I would pair that with a reliability memory:

failing LangWatch traces
recurring blocking issues
scenario regressions by prompt version
per-model failure signatures

That gives you both editorial memory and operational memory.

Recommended Real-World Setup

If I were building this for actual use, I would run it like this:

Frontend / Trigger
  ├─ small admin UI or cron trigger
  └─ optional "generate article" button

Agent Runtime
  └─ AgentOS serving one named workflow

Workflow
  ├─ research with Gemini 3 Flash
  ├─ package for Astro + PR with GLM-5
  ├─ final review with Claude 4.6 Sonnet
  └─ escalate to Claude 4.6 Opus only for hard reviews

Persistence
  ├─ workflow sessions in SQLite for local dev
  └─ Postgres in production

Publish Target
  └─ content/articles/*.md in Astro repo

Delivery
  ├─ git branch
  ├─ commit
  └─ GitHub PR

If you want one practical rule above all others, use this one:

Let models generate structured proposals. Let Python perform irreversible actions.

That single rule prevents a large class of reliability problems.

Final Takeaway

The winning idea here is not “use three fancy models.” It is:

deterministic workflow
+ typed step outputs
+ repo-aware packaging
+ critical review gate
+ git-native delivery

That is what turns an LLM demo into an autonomous publishing system.

Agno gives you the workflow abstraction. AgentOS gives you runtime, sessions, and scheduling. OpenRouter gives you model routing and provider flexibility. Astro gives you a beautifully simple destination format: a markdown file in content/articles/.

Put those together, and you can absolutely build a background app that researches, drafts, and opens article PRs automatically in minutes.

Luis Mori Guerra

Recent Articles

Topics