You already have the perfect publish target for this kind of agentic system: an Astro app where a new markdown file in content/articles/ is enough to create a new page. That means the hard part is not rendering. The hard part is building a reliable background pipeline that can research, draft, review, write, commit, and open a pull request without turning your repo into a hallucination machine.
This article shows how I would build that pipeline with Agno for agent/workflow orchestration, AgentOS for runtime and scheduling, and OpenRouter for model access. The architecture matches a real-world split of model responsibilities:
- Draft articles:
google/gemini-3-flash-preview - Code + PR + workflow shaping:
z-ai/glm-5 - Final review / critical reasoning:
anthropic/claude-4.6-sonnetoranthropic/claude-4.6-opus
As of March 20, 2026, those model identifiers are available in current vendor docs or OpenRouter model pages. The exact slugs may evolve, so treat them as a dated snapshot rather than eternal constants.
TL;DR
- Use Agno Workflow for the main publishing pipeline, not a single free-form agent.
- Run it through AgentOS so it can execute in the background with sessions, APIs, and schedules.
- Use OpenRouter to split model roles cleanly: Gemini 3 Flash for research, GLM-5 for packaging/PR prep, Claude 4.6 Sonnet or Opus for final review.
- Write the final artifact into
content/articles/*.md, because Astro already auto-discovers new markdown articles in this repo. - Add LangWatch for tracing, evaluations, and Scenario-based regression tests so the system is observable and reliable, not just autonomous.
- Keep irreversible actions deterministic: let models propose content, but let Python write files, run git, and open the PR.
The Goal
We want an app that does this:
1. Receive a topic or scheduled content brief
2. Research the web and your internal context
3. Build an Astro-ready markdown article
4. Review it critically before publication
5. Write a new file into content/articles/
6. Create a branch, commit, and open a PR
7. Do all of it in the background, usually within a few minutes
In this repo, the publish directory is content/articles/, not just articles/, so the example below targets the real structure you already have.
Why This Structure Works
This problem is not just “run one agent.” It is a repeatable pipeline with state, checkpoints, file system side effects, and git operations. That makes Agno Workflow a better fit than a single free-form agent.
Agno Workflows are built for defined steps whose output flows to the next step. AgentOS then gives that workflow a production runtime, API surface, sessions, and scheduling. OpenRouter gives us one gateway for multiple model families without rewriting the orchestration code each time.
The split of responsibilities looks like this:
Agno
├─ defines agents, schemas, and workflow steps
├─ enforces predictable sequencing
└─ makes it easy to mix model steps and Python executor steps
AgentOS
├─ serves the workflow as an API
├─ stores run/session history
├─ exposes schedules and run management
└─ makes background automation operational
OpenRouter
├─ routes requests to Gemini / GLM / Claude
├─ normalizes tool-calling and request shape
├─ supports provider controls and fallbacks
└─ keeps model switching mostly configuration-level
Astro
├─ needs only a markdown file in content/articles/
├─ auto-discovers the article
└─ turns git changes into a previewable site via PR
The Reference File We Are Studying
To make this teachable, I am going to center the explanation on one reference file:
article_publisher.py
Its purpose is simple:
Take a content request in,
run a multi-model workflow,
emit one Astro markdown file,
then create a GitHub PR.
Here is the high-level file structure:
from __future__ import annotations
import os
import re
import subprocess
from datetime import datetime, timezone
from pathlib import Path
from typing import Literal
from uuid import uuid4
from agno.agent import Agent
from agno.db.sqlite import SqliteDb
from agno.models.openrouter import OpenRouterResponses
from agno.os import AgentOS
from agno.tools.duckduckgo import DuckDuckGoTools
from agno.workflow import Step, Workflow
from agno.workflow.types import StepInput, StepOutput
from pydantic import BaseModel, Field
At a glance, the file has five responsibilities:
- Declare typed schemas for every stage.
- Configure model-specific agents through OpenRouter.
- Define deterministic workflow steps.
- Perform file system and git side effects in Python executor steps.
- Serve and schedule the workflow through AgentOS.
Critical Code Section 1: Typed Contracts Between Steps
The most important architectural choice is not the prompt. It is the schema boundary between steps.
class ResearchPacket(BaseModel):
topic: str
angle: str = Field(description="Why this angle is interesting now")
audience: str
key_claims: list[str]
sources: list[str] = Field(description="High quality source URLs")
outline: list[str]
risks: list[str] = Field(description="Areas that need verification")
class AstroArticlePackage(BaseModel):
title: str
excerpt: str
tags: list[str]
slug: str
seo_title: str
seo_description: str
body_markdown: str
commit_message: str
pr_title: str
pr_body: str
class ReviewDecision(BaseModel):
publish_ready: bool
blocking_issues: list[str]
revision_notes: list[str]
final_title: str | None = None
final_excerpt: str | None = None
Why this matters:
- It prevents every step from guessing what the previous step meant.
- It makes failures visible earlier.
- It lets you stop before touching git if quality is not high enough.
- It keeps the workflow explainable when you revisit a failed run a day later.
Without schemas, your pipeline becomes “some markdown-ish text went into another prompt and now we hope for the best.”
Critical Code Section 2: Model Specialization Through OpenRouter
This is where the ideal real-world setup becomes concrete.
def openrouter_model(primary: str, fallbacks: list[str] | None = None) -> OpenRouterResponses:
models = [primary, *(fallbacks or [])]
return OpenRouterResponses(
id=primary,
models=models,
app_name="astro-content-factory",
)
research_agent = Agent(
name="Research Drafter",
model=openrouter_model(
"google/gemini-3-flash-preview",
["google/gemini-2.5-flash"],
),
tools=[DuckDuckGoTools()],
output_schema=ResearchPacket,
instructions="""
Research the topic with current web evidence.
Prefer primary sources.
Return a tight article angle, source list, outline, and explicit uncertainty notes.
""",
)
packaging_agent = Agent(
name="Astro Packager",
model=openrouter_model(
"z-ai/glm-5",
["z-ai/glm-5-turbo"],
),
output_schema=AstroArticlePackage,
instructions="""
Convert the research packet into a production-ready Astro markdown article package.
The output must be repo-ready: title, excerpt, tags, slug, SEO fields, markdown body,
commit message, PR title, and PR body.
""",
)
review_agent = Agent(
name="Critical Reviewer",
model=openrouter_model(
"anthropic/claude-4.6-sonnet",
["anthropic/claude-4.6-opus"],
),
output_schema=ReviewDecision,
instructions="""
Review the article package like a demanding editor.
Block publication if the claims are weak, unsupported, repetitive, or misleading.
Approve only if the article is publishable with minimal editorial risk.
""",
)
This is the exact pattern I like:
- Gemini 3 Flash does the high-throughput research synthesis.
- GLM-5 produces the repo-aware package and PR metadata.
- Claude 4.6 Sonnet or Opus plays the hard-nosed editor.
That is not just model fan fiction. It encodes a practical separation of labor:
- cheap/fast model for exploration
- strong engineering model for structured package generation
- expensive/reliable model for final judgment
Critical Code Section 3: Deterministic Workflow Steps
The core pipeline should be a Workflow, not a chat loop.
def normalize_request(step_input: StepInput) -> StepOutput:
raw = str(step_input.input or "").strip()
if not raw:
return StepOutput(
step_name="normalize_request",
content="Missing topic",
stop=True,
)
return StepOutput(
step_name="normalize_request",
content={
"topic": raw,
"requested_at": datetime.now(timezone.utc).isoformat(),
},
)
def quality_gate(step_input: StepInput) -> StepOutput:
review = step_input.previous_step_content
if not isinstance(review, ReviewDecision):
return StepOutput(
step_name="quality_gate",
content="Review step did not return ReviewDecision",
stop=True,
)
if not review.publish_ready:
return StepOutput(
step_name="quality_gate",
content={
"status": "blocked",
"issues": review.blocking_issues,
"notes": review.revision_notes,
},
stop=True,
)
return StepOutput(
step_name="quality_gate",
content={"status": "approved"},
)
workflow = Workflow(
name="astro_article_publisher",
db=SqliteDb(db_file="tmp/article_publisher.db"),
add_workflow_history_to_steps=True,
num_history_runs=10,
steps=[
Step(name="normalize_request", executor=normalize_request),
Step(name="research", agent=research_agent),
Step(name="package_for_astro", agent=packaging_agent),
Step(name="review", agent=review_agent),
Step(name="quality_gate", executor=quality_gate),
Step(name="write_article", executor=write_article_file),
Step(name="open_pull_request", executor=open_pull_request),
],
)
This section is the heart of the system.
Why?
- It makes the flow auditable.
- It gives each side effect a named checkpoint.
- It separates “thinking” steps from “do irreversible stuff” steps.
- It lets you stop safely before file writes and PR creation.
Notice the order:
research -> package -> review -> gate -> write -> PR
That is deliberate. The git side effects happen after critical review, not before.
Critical Code Section 4: Writing the Astro Markdown File Safely
This is where the abstract workflow becomes a publishing system.
REPO_ROOT = Path(os.environ["ASTRO_REPO_ROOT"]).resolve()
ARTICLE_DIR = REPO_ROOT / "content" / "articles"
def slugify(value: str) -> str:
normalized = re.sub(r"[^a-zA-Z0-9]+", "-", value.lower()).strip("-")
return re.sub(r"-{2,}", "-", normalized)
def write_article_file(step_input: StepInput) -> StepOutput:
package = step_input.get_step_content("package_for_astro")
review = step_input.get_step_content("review")
if not isinstance(package, AstroArticlePackage):
raise ValueError("package_for_astro must return AstroArticlePackage")
slug = slugify(package.slug or package.title)
article_path = ARTICLE_DIR / f"{slug}.md"
if article_path.exists():
raise FileExistsError(f"Refusing to overwrite existing article: {article_path}")
final_title = review.final_title if isinstance(review, ReviewDecision) and review.final_title else package.title
final_excerpt = (
review.final_excerpt
if isinstance(review, ReviewDecision) and review.final_excerpt
else package.excerpt
)
markdown = f'''---
title: "{final_title}"
excerpt: "{final_excerpt}"
tags: {package.tags!r}
date: "{datetime.now().date().isoformat()}"
seoTitle: "{package.seo_title}"
seoDescription: "{package.seo_description}"
canonical: "https://luismori.dev/article/{slug}"
---
{package.body_markdown}
'''
article_path.write_text(markdown, encoding="utf-8")
return StepOutput(
step_name="write_article",
content={
"status": "written",
"slug": slug,
"path": str(article_path),
},
)
This function is deceptively simple. It carries several important production ideas:
- Idempotency: do not silently overwrite an existing slug.
- Separation of concerns: the model proposes content, Python performs the file write.
- Last-mile normalization: final slug and final title are resolved outside the model.
- Repo-specific correctness: it writes to
content/articles/, which is what this Astro app actually loads.
Critical Code Section 5: Branch, Commit, and PR Automation
This is the point where many demos become fake. They generate markdown, but a human still has to do the git work. If you want true background publishing, the workflow must finish the job.
def run(cmd: list[str], cwd: Path) -> str:
completed = subprocess.run(
cmd,
cwd=cwd,
check=True,
text=True,
capture_output=True,
)
return completed.stdout.strip()
def open_pull_request(step_input: StepInput) -> StepOutput:
package = step_input.get_step_content("package_for_astro")
write_result = step_input.get_step_content("write_article")
if not isinstance(package, AstroArticlePackage):
raise ValueError("package_for_astro must return AstroArticlePackage")
branch = f"codex/article-{uuid4().hex[:8]}"
run(["git", "checkout", "-b", branch], cwd=REPO_ROOT)
run(["git", "add", str(write_result["path"])], cwd=REPO_ROOT)
run(["git", "commit", "-m", package.commit_message], cwd=REPO_ROOT)
run(["git", "push", "-u", "origin", branch], cwd=REPO_ROOT)
pr_url = run(
[
"gh",
"pr",
"create",
"--base",
"main",
"--title",
package.pr_title,
"--body",
package.pr_body,
],
cwd=REPO_ROOT,
)
return StepOutput(
step_name="open_pull_request",
content={"branch": branch, "pr_url": pr_url},
)
This is where GLM-5 earns its keep. The model is not executing git. Python is. But GLM-5 is supplying the structured metadata that makes the git layer clean and deterministic: commit message, PR title, PR body, and repo-shaped markdown.
Critical Code Section 6: Serving the Workflow with AgentOS
Now we make it operational.
agent_os = AgentOS(
id="astro-content-factory",
workflows=[workflow],
scheduler=True,
)
app = agent_os.get_app()
if __name__ == "__main__":
agent_os.serve(app=app, port=7777, reload=True)
This small section changes the whole system:
- your workflow becomes an API
- sessions and run history can be persisted
- schedules can trigger it
- other services can call it remotely
That is the difference between “a Python script that writes markdown” and “an agent app.”
The Control Flow Diagram
┌──────────────────────┐
│ Topic / Content Brief│
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ normalize_request │
│ validate + timestamp │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ research │
│ Gemini 3 Flash │
│ + web tools │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ package_for_astro │
│ GLM-5 │
│ markdown + PR data │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ review │
│ Claude 4.6 │
│ approve / block │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ quality_gate │
│ stop if weak │
└───────┬────────┬─────┘
│ │
blocked approved
│ │
▼ ▼
run ends write_article
│
▼
open_pull_request
│
▼
branch + PR URL
The Data Flow Diagram
Raw topic
│
▼
{ topic, requested_at }
│
▼
ResearchPacket
├─ angle
├─ sources
├─ key_claims
├─ outline
└─ risks
│
▼
AstroArticlePackage
├─ title
├─ excerpt
├─ tags
├─ slug
├─ seo_title
├─ seo_description
├─ body_markdown
├─ commit_message
├─ pr_title
└─ pr_body
│
▼
ReviewDecision
├─ publish_ready
├─ blocking_issues
├─ revision_notes
├─ final_title?
└─ final_excerpt?
│
▼
File write result
├─ slug
├─ path
└─ status
│
▼
PR result
├─ branch
└─ pr_url
The Component Relationship Diagram
┌─────────────────────────────┐
│ AgentOS │
│ API + sessions + schedules │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Agno Workflow │
│ astro_article_publisher │
└──────┬───────────┬──────────┘
│ │
│ ▼
│ Python Executors
│ ├─ normalize_request
│ ├─ quality_gate
│ ├─ write_article_file
│ └─ open_pull_request
│
▼
Agno Agents
├─ research_agent
├─ packaging_agent
└─ review_agent
│
▼
OpenRouter
├─ Gemini 3 Flash Preview
├─ GLM-5
└─ Claude 4.6 Sonnet / Opus
│
▼
External Systems
├─ Web search
├─ Astro repo filesystem
└─ GitHub CLI / PR API
How to Run It in the Background
This is the most important real-world distinction in the whole article.
If your goal is “all in background and in minutes automatically,” do not make the core publish pipeline a background hook.
Background hooks in AgentOS are great for:
- analytics
- notifications
- evaluations
- webhooks
They are not the right primitive for the main publication path because the docs explicitly note that background hooks run after the response is sent and cannot modify the request or response. They are for non-critical side work, not for the main artifact-creation path.
For the main job, use one of these patterns:
- AgentOS schedule
- An API-triggered workflow run
- An external queue or cron that calls the workflow endpoint
The cleanest version for content automation is:
Scheduler/cron
-> POST AgentOS workflow endpoint
-> workflow runs
-> article file is written
-> PR is opened
-> run history is stored
Example: schedule the workflow
AgentOS exposes schedule management APIs, so a separate bootstrap script can create a recurring job:
import httpx
client = httpx.Client(base_url="http://localhost:7777", timeout=30)
client.post(
"/schedules",
json={
"name": "daily-article-run",
"cron_expr": "0 9 * * 1-5",
"endpoint": "/workflows/astro_article_publisher/runs",
"method": "POST",
"payload": {"message": "Research and draft one article about practical AI engineering"},
"timezone": "America/Lima",
"max_retries": 2,
"retry_delay_seconds": 60,
},
)
That gives you recurring runs without turning the publish logic into an awkward side effect of some unrelated request.
Add LangWatch for Monitoring, Observability, and Reliability
Autonomy without observability is just faster failure.
If this pipeline is going to run in the background and open PRs by itself, you need a reliability layer that answers:
- Which step failed?
- Which model produced the bad output?
- Which source list was used?
- Did quality regress after a prompt or model change?
- Are latency, token cost, and pass rate getting worse over time?
This is where LangWatch fits. LangWatch’s current docs position it as an observability and tracing platform built on OpenTelemetry, with SDKs for Python, TypeScript, and Go, plus agent simulations through Scenario.
The simplest mental model is:
AgentOS workflow runs the pipeline
│
▼
LangWatch trace wraps the full run
│
├─ span: normalize_request
├─ span: research
├─ span: package_for_astro
├─ span: review
├─ span: write_article
└─ span: open_pull_request
│
▼
Evaluations + dashboards + alerts + scenario history
Critical Code Section 7: Instrument the Workflow with LangWatch
The current LangWatch Python guide recommends calling langwatch.setup() early, then using traces and spans to capture end-to-end operations and nested steps.
import langwatch
langwatch.setup()
@langwatch.trace(name="astro_article_publish_run")
def run_publish_pipeline(topic: str, session_id: str, user_id: str = "content-bot"):
current_trace = langwatch.get_current_trace()
current_trace.update(
metadata={
"thread_id": session_id,
"user_id": user_id,
"workflow_name": "astro_article_publisher",
"topic": topic,
}
)
return workflow.run(input=topic, session_id=session_id)
This gives you a top-level trace for the whole publication attempt. Under that trace, each important function can be decorated as a span:
@langwatch.span(name="write_article_file")
def write_article_file(step_input: StepInput) -> StepOutput:
package = step_input.get_step_content("package_for_astro")
article_path = ARTICLE_DIR / f"{slugify(package.slug or package.title)}.md"
markdown = render_markdown(package)
article_path.write_text(markdown, encoding="utf-8")
langwatch.get_current_span().update(
input={"slug": package.slug, "title": package.title},
output={"path": str(article_path), "status": "written"},
)
return StepOutput(
step_name="write_article",
content={"status": "written", "path": str(article_path)},
)
That span-level instrumentation matters because when a run goes wrong, you usually do not care that “the workflow failed.” You care that:
- the research step used weak sources
- the review step blocked the article
- the git step failed after branch creation
LangWatch traces make those failures inspectable rather than mysterious.
Critical Code Section 8: Attach Evaluations to the Quality Gate
LangWatch’s current evaluation docs describe three useful mechanisms:
- client-side custom evaluations via
add_evaluation() - server-side managed evaluations
- guardrails that can influence application flow
For this publishing pipeline, the quality_gate step is the natural place to log these.
@langwatch.span(name="quality_gate")
def quality_gate(step_input: StepInput) -> StepOutput:
review = step_input.previous_step_content
if not isinstance(review, ReviewDecision):
langwatch.get_current_span().add_evaluation(
name="review-contract-valid",
passed=False,
details="Review step did not return ReviewDecision",
is_guardrail=True,
)
return StepOutput(step_name="quality_gate", content="Invalid review payload", stop=True)
passed = review.publish_ready and not review.blocking_issues
langwatch.get_current_span().add_evaluation(
name="publish-ready",
passed=passed,
details={
"blocking_issues": review.blocking_issues,
"revision_notes": review.revision_notes,
},
is_guardrail=True,
)
if not passed:
return StepOutput(
step_name="quality_gate",
content={"status": "blocked", "issues": review.blocking_issues},
stop=True,
)
return StepOutput(step_name="quality_gate", content={"status": "approved"})
This is a high-leverage pattern:
- the workflow still decides whether to continue
- LangWatch records the decision as structured evaluation data
- you can track pass/fail rate over time across runs, prompts, or model swaps
That transforms “Claude said this article was weak” into an observable metric.
LangWatch Reliability Architecture
Scheduler / API Trigger
│
▼
AgentOS workflow run
│
▼
LangWatch trace
├─ metadata: topic, session_id, user_id, workflow_name
├─ span: research
│ └─ evals: source_quality, freshness_check
├─ span: package_for_astro
│ └─ evals: schema_valid, frontmatter_valid
├─ span: review
│ └─ evals: publish_ready, citation_risk
├─ span: write_article
│ └─ evals: file_written
└─ span: open_pull_request
└─ evals: pr_opened
│
▼
Dashboard views
├─ pass rate by topic type
├─ failure rate by step
├─ latency by model
├─ token cost by run
└─ regression history by prompt version
Test-Based Scenarios with LangWatch Scenario
Observability tells you what happened in production. Scenario tests help you prevent regressions before production.
LangWatch’s Scenario framework is simulation-based testing for agents. Instead of writing fragile input-output assertions, you describe a realistic situation, let a simulated user interact with your agent, and judge the behavior at checkpoints during the conversation.
That is a much better fit for this content pipeline because reliability is not just about string equality. It is about whether the system:
- asks for clarification when the topic is too vague
- cites current, credible sources
- avoids duplicate articles
- blocks publication when the review is negative
- never opens a PR for a rejected draft
Critical Code Section 9: Wrap the Publishing App as a Scenario Adapter
Scenario works through an AgentAdapter interface with a call() method. For this workflow, the adapter can invoke the AgentOS workflow endpoint or call the local workflow.run() directly in tests.
import scenario
class ArticlePublisherAdapter(scenario.AgentAdapter):
async def call(self, input: scenario.AgentInput):
user_message = input.messages[-1]["content"]
result = workflow.run(input=user_message, session_id="scenario-test")
if result.content and isinstance(result.content, dict):
if result.content.get("status") == "blocked":
return f"Publication blocked: {result.content}"
return str(result.content)
Critical Code Section 10: Scenario Tests for Editorial Reliability
This is where reliability becomes concrete.
import pytest
import scenario
@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_rejects_weakly_supported_topic():
result = await scenario.run(
name="reject weak sourcing",
description="""
The user asks for an article on a trending claim with weak evidence.
The pipeline should surface sourcing risk and avoid opening a PR.
""",
agents=[
ArticlePublisherAdapter(),
scenario.UserSimulatorAgent(),
scenario.JudgeAgent(),
],
script=[
scenario.user("Write an article claiming framework X is dead with no primary sources"),
scenario.agent(),
scenario.judge(criteria=[
"The system identifies evidence weakness or uncertainty",
"The system does not claim the article is ready to publish",
"The system does not proceed as if a PR was opened",
]),
],
)
assert result.success
You would then add more scenarios for the real failure modes of the app:
- Duplicate topic scenario: a near-identical article already exists in
content/articles/ - Bad source scenario: sources are low-quality SEO spam or forum hearsay
- Review-block scenario: the final reviewer rejects the draft and the pipeline must stop cleanly
- Git failure scenario: branch or PR creation fails and the run must surface the failure without pretending success
- Happy path scenario: a well-scoped topic produces a clean markdown file and PR metadata
Scenario Coverage Map
Risk Scenario Test
──────────────────────────── ───────────────────────────────────────
Hallucinated claims judge checks for explicit uncertainty
Weak sourcing judge checks source credibility behavior
Duplicate content agent must detect existing article overlap
Bad frontmatter deterministic validator + scenario checkpoint
PR opened after failed review scenario ensures blocked runs stop early
Prompt/model regression batch scenario history shows pass-rate drift
How LangWatch Helps After the Scenario Runs
The current LangWatch simulation docs emphasize that once Scenario is connected, runs can be visualized in the LangWatch platform, where you can:
- organize simulations into sets and batches
- inspect full conversations
- debug failing runs
- track performance over time with run history
That means the workflow’s reliability loop becomes:
Change prompt / model / tool
│
▼
Run scenario suite
│
▼
View failures in LangWatch
│
▼
Inspect trace + judge output + conversation
│
▼
Fix workflow / prompts / gating
│
▼
Re-run until pass rate stabilizes
What to Measure in Production
Once LangWatch is attached, I would monitor these metrics first:
- publish-ready rate by topic category
- blocked rate by reviewer model
- PR-open success rate
- median and P95 latency per workflow step
- token cost per successful article
- source-quality failures over time
- duplicate-detection failures
- scenario pass rate on the regression suite
If I had to choose only one practical operating rule, it would be this:
Every production incident should become a Scenario regression test.
That is how observability turns into reliability instead of just dashboards.
Five Levels of Understanding
Basic Level
At the most basic level, this app is an automatic content factory for an Astro site.
It takes a topic, asks one model to research it, asks another model to convert that research into a markdown article, asks a stricter model to review it, uses LangWatch to trace and score the run, and only then writes the article into content/articles/ and opens a PR.
The purpose is not “AI writes blogs.” The purpose is:
- save time on the repetitive publishing workflow
- keep the repo as the source of truth
- preserve a human-review checkpoint through the pull request
The most important thing to understand at this level is that the output artifact is not a chat response. It is a git change.
Medium Level
At the medium level, you should see the system as a pipeline of specialized components.
The key moving parts are:
- research agent for evidence gathering and outline generation
- packaging agent for transforming research into Astro-ready markdown and git metadata
- review agent for approval or rejection
- LangWatch tracing and evaluations for visibility into each run
- Scenario regression tests for realistic reliability coverage
- executor steps for safe local actions like file writes and git commands
- workflow session storage so repeated runs can remember what they already produced
The key methods and functions are:
output_schemaon each agent to enforce typed resultsStep(...)to make each stage explicit and namedStepInput.get_step_content(...)to read prior results safelyStepOutput(stop=True)to halt unsafe or low-quality runsAgentOS(...).get_app()to expose the workflow as an API
This is also the level where the model split starts to feel natural:
- Gemini is good at fast research drafting
- GLM-5 is good at engineering-shaped packaging
- Claude is good at saying “no” when quality is not there
LangWatch adds the next layer of discipline:
- traces tell you what happened
- evaluations tell you whether it was good
- scenarios tell you whether it stays good after changes
Advanced Level
At the advanced level, the file reveals three strong design patterns.
1. Schema-driven orchestration
Every step speaks through a typed contract. This prevents prompt drift from compounding across the pipeline.
2. Hybrid workflow design
The system is not pure agentic improvisation. It mixes:
- agent steps for language-heavy work
- Python executor steps for deterministic local actions
- observability spans and evaluations for measurable behavior
That is a mature pattern. Models should decide content. Python should own side effects.
3. Post-review side effects
The workflow deliberately delays file writes and PR creation until after review. This is a safety-first architectural decision.
This level is also where AgentOS matters beyond “serving an app.” It gives you:
- remote execution
- schedules
- session persistence
- workflow history
Workflow history is especially valuable here because it lets future runs access prior executions. That helps the app avoid producing three nearly identical articles over a week.
LangWatch complements that by giving you a separate operational history:
- run traces
- step-level latency
- evaluation scores
- scenario batch outcomes
Expert Level
At the expert level, you stop asking “does it work?” and start asking “how does it fail?”
Here are the main edge cases and operational concerns.
Duplicate topics
If the same topic is requested twice, you can get duplicate content or slug collisions. The file already blocks overwriting existing slugs, but in practice you should also:
- compare against recent workflow history
- compare against existing article titles/slugs
- reject near-duplicate topics before drafting
Weak sources
A research agent with search tools can still return shallow or circular evidence. That is why the research schema includes:
- explicit source URLs
- key claims
- risks and uncertainty notes
The reviewer should fail the run if claims are not actually supported.
Git side effects in dirty repos
If the working tree is dirty, a branch-and-commit step can accidentally mix unrelated changes. In production, you should either:
- run in a clean clone/worktree
- or validate cleanliness before any git step
PR creation failures
gh pr create can fail because of auth, rate limits, missing remotes, or branch protections. The workflow should capture and persist those failures as run output so the run is observable rather than silently broken.
Cost and latency
This stack is fast enough for “minutes, not hours,” but only if you keep the steps narrow:
- Gemini research should produce structured notes, not a full essay plus notes plus summary plus alternative version
- GLM-5 should package, not re-research
- Claude should review, not rewrite the entire article from scratch
In other words: narrow prompts, typed outputs, and limited step scope are performance optimizations.
LangWatch helps prove whether those optimizations are working because you can watch:
- latency drift after prompt changes
- cost increases after model swaps
- approval-rate regressions after agent edits
- repeated failures clustered around one step
Why background hooks are the wrong core primitive
This deserves repeating because it is subtle. AgentOS background hooks are sequential after the response and are meant for non-critical work. They are excellent for:
- analytics
- notifications
- async evaluations
But your core publish path must remain a first-class workflow or scheduled run, because it needs to produce the primary artifact and preserve control over success/failure semantics.
LangWatch is a good match here because it observes the first-class workflow directly instead of living inside a side-channel-only architecture.
Legendary Level
At the legendary level, the interesting question is not “how do I publish one article?” It is “what system am I creating if this works?”
You are effectively building a content operations platform backed by git.
That has several long-term implications.
1. The repo becomes the publication database
That is powerful. The workflow is stateless at the LLM boundary, but stateful at the git boundary:
- article versions live in commits
- editorial review lives in PR comments
- deployment preview lives in the preview environment
This is much better than storing AI-generated posts in some opaque database row.
2. Your moat becomes evaluation, not generation
Any team can wire a model to emit markdown. The hard part is building the surrounding system that answers:
- Was the topic worth writing about?
- Were the sources good?
- Did the article duplicate existing content?
- Was the PR clean and reviewable?
The defensible part of the system is not the draft. It is the gating, memory, and quality loop.
This is exactly where LangWatch Scenario becomes strategically important. Once you have a bank of realistic scenarios and pass/fail history, your system stops being “a prompt that currently works” and starts becoming “a workflow with measurable reliability.”
3. You will eventually want a queue and a content registry
The single-file demo is enough to prove the pattern. A scaled system will usually grow into:
- a topics table or queue
- deduplication scoring
- article status states like
queued,researching,drafted,blocked,pr_opened,merged - observability and eval traces
- approval workflows for high-impact topics
At that point, AgentOS remains a strong orchestration/runtime layer, but you will likely add:
- Postgres instead of SQLite
- a worker process or task queue
- content scoring and retrieval over prior posts
- automated evals after each run
4. The cleanest future improvement is a two-pass review system
If this were my long-term architecture, I would evolve it into:
Pass 1: fast reviewer
-> catches obvious issues cheaply
Pass 2: expensive reviewer
-> runs only if the article is close to publishable
That gives you better cost control while still preserving strong final judgment.
5. The highest-leverage improvement is “research memory”
Instead of starting each run from zero, I would store:
- previously used sources
- rejected claims
- accepted article angles
- tag/topic coverage gaps
Then the workflow becomes editorially smarter over time instead of merely faster.
I would pair that with a reliability memory:
- failing LangWatch traces
- recurring blocking issues
- scenario regressions by prompt version
- per-model failure signatures
That gives you both editorial memory and operational memory.
Recommended Real-World Setup
If I were building this for actual use, I would run it like this:
Frontend / Trigger
├─ small admin UI or cron trigger
└─ optional "generate article" button
Agent Runtime
└─ AgentOS serving one named workflow
Workflow
├─ research with Gemini 3 Flash
├─ package for Astro + PR with GLM-5
├─ final review with Claude 4.6 Sonnet
└─ escalate to Claude 4.6 Opus only for hard reviews
Persistence
├─ workflow sessions in SQLite for local dev
└─ Postgres in production
Publish Target
└─ content/articles/*.md in Astro repo
Delivery
├─ git branch
├─ commit
└─ GitHub PR
If you want one practical rule above all others, use this one:
Let models generate structured proposals. Let Python perform irreversible actions.
That single rule prevents a large class of reliability problems.
Final Takeaway
The winning idea here is not “use three fancy models.” It is:
deterministic workflow
+ typed step outputs
+ repo-aware packaging
+ critical review gate
+ git-native delivery
That is what turns an LLM demo into an autonomous publishing system.
Agno gives you the workflow abstraction. AgentOS gives you runtime, sessions, and scheduling. OpenRouter gives you model routing and provider flexibility. Astro gives you a beautifully simple destination format: a markdown file in content/articles/.
Put those together, and you can absolutely build a background app that researches, drafts, and opens article PRs automatically in minutes.
Sources
- Agno Workflows Overview
- Agno Step-Based Workflows
- Agno StepInput Reference
- Agno Workflow Sessions
- Agno AgentOS Reference
- Agno Background Hooks
- Agno Schedule Management Example
- Agno OpenRouter Integration
- Agno Structured Output for Agents
- LangWatch Integration Overview
- LangWatch Python Integration Guide
- LangWatch OpenTelemetry Guide
- LangWatch Observability Overview
- LangWatch Evaluations & Guardrails
- LangWatch Agent Simulations Introduction
- LangWatch Scenario Framework
- OpenRouter API Reference
- OpenRouter Provider Routing
- OpenRouter Tool Calling Models Collection
- OpenRouter Anthropic Model Catalog
- OpenRouter Z.ai Model Catalog
- Google Gemini OpenAI Compatibility Docs