Modern production deployments involve three distinct layers: infrastructure provisioning, application packaging, and continuous delivery. This article ties all three together using Terraform, Helm, and ArgoCD — from first principles to architectural-level insights — showing how they compose into a coherent GitOps pipeline for both a Node.js REST API and an Agno Python agent (a FastAPI-based AI agent framework).
Each section builds across five expertise levels so you can read at your own depth.
The GitOps Model: Why Three Tools?
Before diving in, understand what problem each tool solves:
| Tool | Problem it solves | Where it runs |
|---|---|---|
| Terraform | ”What infrastructure exists?” | CI runner / developer machine |
| Helm | ”How is this app configured for each environment?” | Git repository |
| ArgoCD | ”Does the cluster match what Git says?” | Inside the cluster |
GitOps is the model that ties them together. Git is the single source of truth. Every change — a new image tag, a scaling adjustment, a config tweak — enters the system as a commit. An automated reconciler continuously compares what Git says should exist against what actually runs in the cluster, and corrects any drift.
The two deployment models in practice:
- Push-based: CI pipeline calls
kubectl applyafter merge. CI needs cluster credentials. - Pull-based: An in-cluster agent polls Git and applies changes. No external system needs cluster access.
ArgoCD uses the pull model. The agent inside the cluster does the work — which means your CI pipeline never needs a kubeconfig.
Data Flow: The Full Picture
┌─────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE LAYER (Terraform) │
│ │
│ Developer ──git push──► CI Runner │
│ │ │
│ terraform plan │
│ terraform apply │
│ │ │
│ ┌────────────────┼───────────────┐ │
│ ▼ ▼ ▼ │
│ VPC + EKS RDS (PG) IAM/IRSA │
│ │ │
│ outputs ─────────┼──────────────► │
│ (cluster_endpoint, role_arns, rds_endpoint) │
└───────────────────────────────┼─────────────────────────── ┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER (Helm) │
│ │
│ k8s-manifests repo (Git) │
│ ├── charts/ │
│ │ ├── node-api/ ← Helm chart │
│ │ │ ├── Chart.yaml │
│ │ │ ├── values.yaml ← defaults │
│ │ │ └── templates/ ← K8s manifest templates │
│ │ └── agno-agent/ ← Helm chart │
│ ├── environments/ │
│ │ ├── staging/values.yaml ← env overrides │
│ │ └── production/values.yaml │
│ └── apps/ │
│ ├── root.yaml ← ArgoCD bootstrap │
│ ├── node-api.yaml ← ArgoCD Application CRD │
│ └── agno-agent.yaml ← ArgoCD Application CRD │
└───────────────────────────────┬─────────────────────────── ┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DELIVERY LAYER (ArgoCD, inside EKS) │
│ │
│ ArgoCD polls Git every 3m (or via webhook) │
│ │ │
│ ├─ helm template → renders manifests │
│ ├─ diff rendered vs live cluster state │
│ │ │
│ ├─ OutOfSync detected │
│ │ │ │
│ │ ├── PreSync hooks (db-migrate Job) │
│ │ ├── Apply Deployment (rolling update) │
│ │ └── PostSync hooks (smoke test) │
│ │ │
│ └─ Status: Synced / Healthy │
└─────────────────────────────────────────────────────────────┘
CI Pipeline Flow: Developer Push to Running Pod
Developer
│
│ git push origin main (Node.js API source repo)
▼
GitHub Actions
├─ 1. Run tests (npm test)
├─ 2. Authenticate to AWS via OIDC (no long-lived secrets)
├─ 3. docker build → push to ECR
│ tag: acme/node-api:abc1234 (git SHA)
└─ 4. Checkout k8s-manifests repo
yq -i ".image.tag = \"abc1234\"" environments/production/values.yaml
git commit + git push
k8s-manifests repo
│ (now has new image tag in production values)
▼
ArgoCD controller (in-cluster, polling every 3 min)
│ Detects: image.tag 2.1.0 → abc1234 → OutOfSync
│
├─ Renders Helm chart with new values
├─ Runs PreSync hook: db-migrate Job (waits for completion)
├─ kubectl apply Deployment (RollingUpdate: maxUnavailable=0)
│ K8s pulls acme/node-api:abc1234 from ECR
│ New pods → readinessProbe /health → 200 OK → Ready
│ Old pods → Terminating
└─ PostSync hooks: smoke test, Slack notification
Cluster
└─ 3 running pods of node-api:abc1234
behind Service → Ingress → ALB → Internet
1. Terraform: Infrastructure as Code
Basic — What is it?
Terraform lets you describe your cloud infrastructure using code instead of clicking through UIs. You write files that say “I want a Kubernetes cluster with 3 worker nodes and a PostgreSQL database” and Terraform makes it happen. If you run it again, it only changes what’s different — it’s idempotent.
Think of it as a blueprint: the blueprint doesn’t change every time you look at it, but you can version it, review it, and roll it back.
Three things Terraform manages:
- Providers: plugins that talk to cloud APIs (AWS, GCP, Azure)
- Resources: individual cloud objects (VPC, EC2, RDS, S3 bucket)
- State: a JSON file recording what Terraform created last time
Medium — Core Structure and Commands
infra/
├── main.tf # Provider config and resource declarations
├── variables.tf # Input variable declarations with types and validation
├── outputs.tf # Values that other configs can consume
├── terraform.tfvars # Actual variable values (gitignore secrets!)
└── modules/
├── vpc/ # Reusable VPC component
├── eks/ # Reusable EKS component
└── rds/ # Reusable RDS component
Essential commands:
terraform init # Download providers and modules
terraform validate # Check HCL syntax (no API calls)
terraform plan # Preview changes — always review before applying
terraform apply # Execute the plan
terraform output # Print output values
terraform workspace new staging # Create named state slice
terraform workspace select prod # Switch environment context
terraform destroy # Tear down (never run in prod without deliberation)
Advanced — Remote State, Modules, and IRSA
Never use local state in production. Store state in S3 with DynamoDB locking so multiple engineers work safely:
# main.tf
terraform {
required_version = ">= 1.7"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.40"
}
}
backend "s3" {
bucket = "acme-terraform-state"
key = "production/eks/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
provider "aws" {
region = var.region
default_tags {
tags = {
ManagedBy = "terraform"
Environment = var.environment
Project = "acme-api"
}
}
}
Variables with validation prevent typos from reaching terraform apply:
# variables.tf
variable "environment" {
description = "Deployment environment"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "db_password" {
description = "Master password for RDS"
type = string
sensitive = true # never printed in plan output
}
VPC + EKS using community modules (handles 200+ lines of boilerplate):
locals {
cluster_name = "acme-${var.environment}"
}
data "aws_availability_zones" "available" { state = "available" }
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.5"
name = "${local.cluster_name}-vpc"
cidr = "10.0.0.0/16"
azs = slice(data.aws_availability_zones.available.names, 0, 3)
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
enable_nat_gateway = true
single_nat_gateway = false # one NAT per AZ for HA
enable_dns_hostnames = true
# Required tags for EKS subnet auto-discovery
private_subnet_tags = {
"kubernetes.io/cluster/${local.cluster_name}" = "shared"
"kubernetes.io/role/internal-elb" = "1"
}
public_subnet_tags = {
"kubernetes.io/cluster/${local.cluster_name}" = "shared"
"kubernetes.io/role/elb" = "1"
}
}
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.8"
cluster_name = local.cluster_name
cluster_version = "1.30"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
enable_irsa = true # pods assume IAM roles without node-level credentials
eks_managed_node_groups = {
general = {
instance_types = var.node_instance_types
min_size = local.node_min
max_size = local.node_max
desired_size = 3
disk_size = 50
}
}
}
IRSA (IAM Roles for Service Accounts): pods assume IAM roles directly — no credentials in env vars or Secrets:
module "node_api_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.39"
role_name = "${local.cluster_name}-node-api"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["apps:node-api"] # namespace:serviceaccount
}
}
role_policy_arns = {
s3 = aws_iam_policy.node_api_s3.arn
}
}
Expert — Workspace Strategy, Drift Detection, and State Operations
Workspaces create separate state slices within one configuration:
locals {
is_prod = terraform.workspace == "prod"
node_min = local.is_prod ? 3 : 1
node_max = local.is_prod ? 20 : 5
db_class = local.is_prod ? "db.r6g.large" : "db.t3.micro"
}
terraform workspace new staging
terraform workspace select staging
terraform apply -var-file=environments/staging.tfvars
Expert pitfall: workspaces share the same provider credentials. If production needs a separate AWS account (recommended for blast-radius isolation), use separate backend configs with separate state files, not just workspaces.
Importing existing resources into state (when you have manually created infra):
# Import an existing RDS instance into state without recreating it
terraform import aws_db_instance.main acme-prod
State surgery when things go wrong:
terraform state list # list all managed resources
terraform state show aws_db_instance.main # inspect a single resource
terraform state mv aws_s3_bucket.old aws_s3_bucket.new # rename without destroy
terraform state rm aws_db_instance.main # stop tracking without deleting
Legendary — The Philosophical Design Decision
Terraform’s biggest architectural bet is convergent reconciliation over imperative scripting. Rather than writing shell scripts that say “create X, then create Y, then configure Z,” you declare the target state and let the tool compute the shortest path to it.
This creates an immutable audit trail: terraform plan is a diff between desired and actual state. Every terraform apply is a committed state transition, not a side-effecting procedure. Combine this with remote state and you have a distributed coordination system — multiple engineers can propose plans simultaneously; only one can apply at a time (DynamoDB lock).
The implication for scaling teams: as your infrastructure grows, the right investment is not more HCL but better module composition. Each module should be independently testable (using terratest), versioned separately, and published to a private registry. The root configuration then composes modules the same way application code composes libraries — with version pinning, not copy-paste.
The known scaling ceiling: Terraform’s refresh phase reads every managed resource from cloud APIs on every plan. At 1,000+ resources in a single state file, this becomes a bottleneck (3–5 minute plans). The solution is state file partitioning by domain (networking, compute, databases, IAM) with cross-state data sources.
2. Helm: Packaging the Application
Basic — What is it?
Helm is a package manager for Kubernetes — like npm for Node.js or pip for Python. Instead of maintaining raw YAML files for every environment, you write templates with variables. Helm renders the templates with the right values for each environment and applies everything to the cluster.
A Helm chart is a folder with:
Chart.yaml— metadata (name, version)values.yaml— default configurationtemplates/— Kubernetes manifests with{{ .Values.something }}placeholders
A Helm release is one deployed instance of a chart. You can have node-api-staging and node-api-prod from the same chart with different values.
Medium — Chart Structure and Core Templates
node-api/
├── Chart.yaml
├── values.yaml
├── .helmignore
└── templates/
├── _helpers.tpl # Named templates (reusable partials)
├── deployment.yaml
├── service.yaml
├── ingress.yaml
├── hpa.yaml # Horizontal Pod Autoscaler
├── serviceaccount.yaml
├── configmap.yaml
└── hooks/
└── db-migrate.yaml # Pre-upgrade migration Job
# Chart.yaml
apiVersion: v2
name: node-api
description: ACME Node.js REST API
type: application
version: 0.5.0 # chart version — bump when chart structure changes
appVersion: "2.1.0" # application version — CI sets this from image tag
# values.yaml — defaults for all environments
replicaCount: 3
image:
repository: 123456789.dkr.ecr.us-east-1.amazonaws.com/acme/node-api
tag: "2.1.0"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
targetPort: 3000
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
serviceAccount:
create: true
annotations:
eks.amazonaws.com/role-arn: "" # set per environment via IRSA
env:
NODE_ENV: production
PORT: "3000"
LOG_LEVEL: info
envSecret:
DB_URL: "" # injected from a Kubernetes Secret
probes:
readiness:
path: /health
initialDelaySeconds: 10
periodSeconds: 5
liveness:
path: /health
initialDelaySeconds: 30
periodSeconds: 10
Key install/upgrade commands:
# Install (first time)
helm install node-api ./charts/node-api \
--namespace apps \
--create-namespace \
--values environments/production/values.yaml \
--set image.tag=2.1.0
# Upgrade (idempotent in CI — use --install flag)
helm upgrade --install node-api ./charts/node-api \
--namespace apps \
--values environments/production/values.yaml \
--set image.tag=2.2.0 \
--wait \
--timeout 5m
# Rollback to the previous revision
helm rollback node-api 0 # 0 = previous revision
# View full release history
helm history node-api -n apps
Advanced — Templates, Helpers, and Hooks
_helpers.tpl defines named templates that are reused across all manifests:
{{/*
Fully qualified name (Release.Name-chart.Name, max 63 chars).
*/}}
{{- define "node-api.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{/* Common labels for all resources */}}
{{- define "node-api.labels" -}}
helm.sh/chart: {{ printf "%s-%s" .Chart.Name .Chart.Version }}
app.kubernetes.io/name: {{ include "node-api.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Values.image.tag | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/* Selector labels — used by Service and Deployment */}}
{{- define "node-api.selectorLabels" -}}
app.kubernetes.io/name: {{ include "node-api.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
The Deployment template uses include + nindent for proper indentation, and the checksum/config annotation triggers a rollout whenever a ConfigMap changes:
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "node-api.fullname" . }}
labels:
{{- include "node-api.labels" . | nindent 4 }}
annotations:
# Force a rollout when ConfigMap content changes
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "node-api.selectorLabels" . | nindent 6 }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # zero-downtime: full capacity before removing old pods
template:
metadata:
labels:
{{- include "node-api.selectorLabels" . | nindent 8 }}
spec:
serviceAccountName: {{ include "node-api.fullname" . }}
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.targetPort }}
protocol: TCP
env:
{{- range $key, $val := .Values.env }}
- name: {{ $key }}
value: {{ $val | quote }}
{{- end }}
{{- range $key, $val := .Values.envSecret }}
- name: {{ $key }}
valueFrom:
secretKeyRef:
name: {{ include "node-api.fullname" $ }}-secret
key: {{ $key }}
{{- end }}
readinessProbe:
httpGet:
path: {{ .Values.probes.readiness.path }}
port: http
initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
failureThreshold: 3
livenessProbe:
httpGet:
path: {{ .Values.probes.liveness.path }}
port: http
initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
failureThreshold: 3
resources:
{{- toYaml .Values.resources | nindent 12 }}
Pre-upgrade database migration hook: runs a Job before new pods start, blocks the rollout until migrations succeed:
# templates/hooks/db-migrate.yaml
{{- if .Values.migration.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "node-api.fullname" . }}-migrate-{{ .Release.Revision }}
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": hook-succeeded
spec:
backoffLimit: 3
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
command: ["node", "dist/db/migrate.js"]
env:
- name: DB_URL
valueFrom:
secretKeyRef:
name: {{ include "node-api.fullname" . }}-secret
key: DB_URL
{{- end }}
Expert — Helm Template Rendering Flow
helm upgrade node-api ./charts/node-api --values prod.yaml
│
▼
Load Chart.yaml metadata
│
▼
Merge values (precedence: --set > --values files > values.yaml)
│
▼
Render each template in templates/ via Go text/template
├─ Evaluate {{ define }} blocks in _helpers.tpl
├─ Resolve {{ include "..." . }} calls
├─ Apply | pipeline functions (nindent, quote, sha256sum, toYaml)
└─ Conditional blocks: {{- if .Values.autoscaling.enabled }}
│
▼
Validate rendered YAML against Kubernetes OpenAPI schema
│
▼
Execute hooks (pre-upgrade Jobs) — wait for completion
│
▼
kubectl apply (server-side apply) the rendered manifests
│
▼
Wait for Deployment rollout if --wait flag set
Expert pitfalls:
- Never generate
rand.AlphaNumorrandBytesin templates — they re-evaluate on every upgrade, causing spurious Secret updates on every deploy. - Never use
latestas an image tag — it breaks reproducibility and prevents rollback. helm lint+helm templatein CI catches rendering errors before they reach the cluster.
Legendary — Helm as a Configuration Interface Contract
The deepest insight about Helm is that values.yaml is not configuration — it is a public API. Every key in values.yaml is a contract between the chart author and the chart consumer. Breaking changes (removing a key, changing its type) require a semver major bump of Chart.version.
This becomes critical at scale: if your platform team maintains a chart used by 20 application teams, each team’s CI pipeline sets --set image.tag=... and --values team-overrides.yaml. Any breaking change to the chart’s values schema must be versioned, communicated, and migrated — exactly like breaking changes in an npm package.
The advanced pattern is library charts: a type: library chart that contains only _helpers.tpl definitions, published separately to a Helm registry. Application charts depend on it (dependencies in Chart.yaml), enabling shared template logic across all your services. This is the Helm equivalent of a design system — one source of truth for labels, probes, resource defaults, and security contexts.
3. ArgoCD: Continuous GitOps Delivery
Basic — What is it?
ArgoCD watches a Git repository and keeps your Kubernetes cluster synchronized with it. You tell ArgoCD “this folder in this Git repo is the desired state of namespace X in cluster Y.” ArgoCD checks every few minutes. If anything drifts — someone manually edits a Deployment, a new config is pushed — ArgoCD corrects it.
The key concept: you never kubectl apply in production. Git is the only way to change production. ArgoCD is the enforcer.
Medium — The Application CRD
The core ArgoCD primitive is the Application custom resource. It’s the link between Git and the cluster:
# apps/node-api.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: node-api
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io # cascade-delete on removal
spec:
project: acme
source:
repoURL: https://github.com/acme-org/k8s-manifests.git
targetRevision: main
path: charts/node-api
helm:
releaseName: node-api
valueFiles:
- values.yaml
- environments/production/values.yaml
parameters:
- name: image.tag
value: "2.1.0" # CI updates this on every build
destination:
server: https://kubernetes.default.svc
namespace: apps
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert manual cluster edits
allowEmpty: false # never sync to empty state
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
revisionHistoryLimit: 10
Installing ArgoCD:
kubectl create namespace argocd
kubectl apply -n argocd --server-side \
-f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Get initial admin password
argocd admin initial-password -n argocd
# Access the UI locally
kubectl port-forward svc/argocd-server -n argocd 8080:443
argocd login localhost:8080
Advanced — App of Apps, Projects, and Sync Waves
AppProject scopes what repos, clusters, and namespaces a team’s apps can touch:
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: acme
namespace: argocd
spec:
description: ACME production applications
sourceRepos:
- https://github.com/acme-org/k8s-manifests.git
destinations:
- server: https://kubernetes.default.svc
namespace: apps
- server: https://kubernetes.default.svc
namespace: agents
clusterResourceWhitelist:
- group: ""
kind: Namespace
roles:
- name: developer
policies:
- p, proj:acme:developer, applications, get, acme/*, allow
- p, proj:acme:developer, applications, sync, acme/*, allow
groups:
- acme-developers # SSO group mapping
App of Apps bootstraps an entire environment from Git. One root Application manages all other Application manifests:
k8s-manifests/
└── apps/
├── root.yaml ← applied once manually to bootstrap
├── node-api.yaml
├── agno-agent.yaml
├── monitoring.yaml
└── ingress-nginx.yaml
# apps/root.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root-apps
namespace: argocd
spec:
project: acme
source:
repoURL: https://github.com/acme-org/k8s-manifests.git
targetRevision: main
path: apps # directory of Application manifests
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
After kubectl apply -f apps/root.yaml once, ArgoCD manages itself and every child Application from Git. Adding a new service = adding one YAML file and committing.
Sync Waves enforce ordering within a single sync:
# Wave -1: deploy databases first
metadata:
annotations:
argocd.argoproj.io/sync-wave: "-1"
# Wave 0 + PreSync hook: run migrations after DB is ready
metadata:
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/sync-wave: "0"
argocd.argoproj.io/hook-delete-policy: HookSucceeded
# Wave 1: deploy the API after migrations complete
metadata:
annotations:
argocd.argoproj.io/sync-wave: "1"
Custom health check in Lua (for CRDs without built-in health):
# argocd-cm ConfigMap
data:
resource.customizations.health.batch_CronJob: |
hs = {}
if obj.status ~= nil then
if obj.status.lastScheduleTime ~= nil then
hs.status = "Healthy"
hs.message = "Last scheduled: " .. obj.status.lastScheduleTime
return hs
end
end
hs.status = "Progressing"
hs.message = "Waiting for first schedule"
return hs
Expert — Reconciliation Loop Internals
ArgoCD Application Controller (reconcile loop, every 3 minutes or webhook-triggered)
│
├─ 1. DESIRED STATE: git clone / sparse-checkout the target path
│ run: helm template <releaseName> <path> --values ...
│ output: list of Kubernetes manifests (JSON)
│
├─ 2. LIVE STATE: kubectl get all resources in destination namespace
│ output: current cluster state (JSON)
│
├─ 3. DIFF: three-way merge
│ base: last-applied annotation on live objects
│ desired: rendered manifests from step 1
│ live: current cluster state from step 2
│ result: list of changes (add / modify / delete)
│
├─ 4. if diff is empty → Application is Synced, skip
│
└─ 5. if diff exists → Application is OutOfSync
├─ if syncPolicy.automated → trigger sync
└─ if manual → wait for user action
│
▼
Execute sync:
├─ PreSync hooks (Jobs) → wait for completion
├─ Apply wave -1 resources → wait for health
├─ Apply wave 0 resources → wait for health
├─ Apply wave 1 resources → wait for health
└─ PostSync hooks
Expert pitfall: prune: true will delete Kubernetes resources that are no longer in Git — including ConfigMaps you added manually, PVCs, Secrets managed outside ArgoCD. Test pruning behavior in staging with --dry-run before enabling in production.
Legendary — The Philosophical Shift: Operational Events as Commits
ArgoCD’s deepest implication is organizational, not technical. When every production change is a Git commit, on-call incidents change shape. “What changed at 2am?” is a git log command, not an Ops investigation. “Roll back the broken deploy” is git revert + wait 3 minutes, not a Helm rollback command with stale state.
The scaling challenge: GitOps creates pressure to separate the application repo from the manifests repo. If you store Helm values in the same repo as Node.js source code, every git push triggers both a code build and an ArgoCD sync evaluation. At scale you want: application repo for source changes (triggers CI, produces images), manifests repo for deployment changes (ArgoCD watches this). CI’s only job is to update the image tag in the manifests repo. This architecture is called push-to-deploy via image updater and can be partially automated with the Argo CD Image Updater project.
4. Deploying an Agno Agent
Basic — What is Agno?
Agno is a Python framework for building AI agents. You define an agent (what model it uses, what tools it has, how it stores memory), and Agno wraps it in a FastAPI application — a full HTTP API server with streaming, session management, and built-in /health endpoints.
From a deployment perspective, an Agno agent is just a FastAPI app. The same Helm + ArgoCD workflow applies. The differences are: it needs an LLM API key as a Secret, it uses more memory (LLM inference buffers), and it needs a persistent database for session state — not a local SQLite file.
Medium — Agno App and Dockerfile
# agent.py
from agno.agent import Agent
from agno.models.anthropic import Claude
from agno.storage.postgres import PostgresStorage
from agno.run.fastapi import AgentOS
from fastapi import FastAPI
agent = Agent(
model=Claude(id="claude-sonnet-4-5"),
description="ACME support agent",
instructions=["Always be concise", "Use bullet points"],
storage=PostgresStorage(
table_name="agent_sessions",
db_url=os.environ["DATABASE_URL"],
),
add_history_to_messages=True,
num_history_responses=5,
)
app: FastAPI = AgentOS(agents=[agent]).get_app()
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# System deps for PostgreSQL driver
RUN apt-get update && apt-get install -y \
libpq-dev gcc \
&& rm -rf /var/lib/apt/lists/*
# Layer caching: install deps before copying source
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY agent.py .
# Non-root user for security
RUN useradd --create-home appuser
USER appuser
EXPOSE 8000
CMD ["fastapi", "run", "agent.py", "--host", "0.0.0.0", "--port", "8000"]
# requirements.txt
agno[os]>=1.0.0
anthropic>=0.30.0
fastapi>=0.111.0
uvicorn[standard]>=0.29.0
psycopg2-binary>=2.9.9
Advanced — Helm Values and ArgoCD Application for Agno
The Agno chart is structurally identical to the Node.js chart. Key differences in values.yaml:
# charts/agno-agent/values.yaml
replicaCount: 2
image:
repository: 123456789.dkr.ecr.us-east-1.amazonaws.com/acme/agno-agent
tag: "1.0.0"
pullPolicy: IfNotPresent
service:
type: ClusterIP
port: 80
targetPort: 8000
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi" # LLM inference buffers are large
envSecrets:
ANTHROPIC_API_KEY: "" # never in Git — injected from K8s Secret
DATABASE_URL: "" # postgres://user:pass@rds-endpoint:5432/acme
probes:
readiness:
path: /health
initialDelaySeconds: 15 # model clients need time to initialize
periodSeconds: 10
liveness:
path: /health
initialDelaySeconds: 60 # longer delay — LLM client startup is slow
periodSeconds: 30
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 60 # scale earlier; LLM calls are bursty
persistence:
enabled: false # use PostgreSQL, not local SQLite
ArgoCD Application (keep AI workloads in a dedicated namespace):
# apps/agno-agent.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: agno-agent
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: acme
source:
repoURL: https://github.com/acme-org/k8s-manifests.git
targetRevision: main
path: charts/agno-agent
helm:
releaseName: agno-agent
valueFiles:
- values.yaml
- environments/production/agno-values.yaml
destination:
server: https://kubernetes.default.svc
namespace: agents # dedicated namespace for AI workloads
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
retry:
limit: 3
backoff:
duration: 10s
factor: 2
maxDuration: 2m
Production value overrides:
# environments/production/agno-values.yaml
replicaCount: 3
image:
tag: "1.2.0"
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/acme-prod-agno-agent"
resources:
requests:
cpu: "1000m"
memory: "1Gi"
limits:
cpu: "4000m"
memory: "4Gi"
autoscaling:
maxReplicas: 20
env:
DATABASE_POOL_SIZE: "10"
Expert — Secret Management for LLM Keys
The ANTHROPIC_API_KEY must never appear in Git. The production pattern is External Secrets Operator (ESO) pulling from AWS Secrets Manager:
# ExternalSecret — creates a K8s Secret from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: agno-agent-secrets
namespace: agents
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: agno-agent-secret # creates this K8s Secret
creationPolicy: Owner
data:
- secretKey: ANTHROPIC_API_KEY
remoteRef:
key: acme/prod/agno-agent
property: anthropic_api_key
- secretKey: DATABASE_URL
remoteRef:
key: acme/prod/agno-agent
property: database_url
This ExternalSecret manifest lives in the manifests repo and is applied by ArgoCD. The actual secret values live only in AWS Secrets Manager — never in Git.
Legendary — Agno Architecture Implications at Scale
Agno’s stateless FastAPI design (agent logic + HTTP layer, state in PostgreSQL) maps cleanly to Kubernetes horizontal scaling. But LLM-serving workloads break HPA assumptions: CPU utilization during an LLM call is low (the work happens in the API provider), but latency is high and concurrency is limited by rate limits, not compute.
This means CPU-based HPA will under-scale. The advanced pattern is custom metrics HPA using in-flight request count or queue depth:
# HPA based on in-flight requests (via KEDA or custom metrics adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: External
external:
metric:
name: agno_agent_in_flight_requests
selector:
matchLabels:
app: agno-agent
target:
type: AverageValue
averageValue: "5" # scale when each pod handles more than 5 concurrent requests
For multi-agent architectures (Agno supports orchestrating multiple specialized agents), each agent type should have its own Deployment with independent HPA targets tuned to its workload pattern. A research-agent that makes many sequential LLM calls has different scaling behavior than a summary-agent that makes one call per request.
5. CI Pipeline: Tying It All Together
# .github/workflows/deploy.yml
name: Build and Deploy
on:
push:
branches: [main]
paths: ["src/**", "package*.json", "Dockerfile"]
env:
AWS_REGION: us-east-1
ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
IMAGE_NAME: acme/node-api
jobs:
build-and-push:
runs-on: ubuntu-latest
permissions:
id-token: write # required for OIDC auth to AWS (no long-lived keys)
contents: read
outputs:
image_tag: ${{ steps.meta.outputs.version }}
steps:
- uses: actions/checkout@v4
- name: Configure AWS via OIDC
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-ecr-push
aws-region: ${{ env.AWS_REGION }}
- name: Login to ECR
uses: aws-actions/amazon-ecr-login@v2
- name: Docker metadata (generates reproducible image tags)
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.ECR_REGISTRY }}/${{ env.IMAGE_NAME }}
tags: type=sha,prefix=,format=short
- name: Build and push (with layer caching)
uses: docker/build-push-action@v5
with:
push: true
tags: ${{ steps.meta.outputs.tags }}
cache-from: type=gha
cache-to: type=gha,mode=max
update-manifests:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Checkout k8s-manifests repo
uses: actions/checkout@v4
with:
repository: acme-org/k8s-manifests
token: ${{ secrets.MANIFESTS_REPO_TOKEN }}
- name: Update image tag with yq
run: |
yq -i ".image.tag = \"${{ needs.build-and-push.outputs.image_tag }}\"" \
environments/production/values.yaml
- name: Commit and push
run: |
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add environments/production/values.yaml
git commit -m "chore(node-api): bump image to ${{ needs.build-and-push.outputs.image_tag }}"
git push
ArgoCD detects the manifest commit within 3 minutes and initiates a sync — no kubectl, no direct cluster access from CI.
6. Production Checklist
Terraform
- Use remote state with DynamoDB locking — never local state in production
- Mark secrets
sensitive = true— keeps them out of plan logs - Pin providers with
~>(minor-compatible), not unpinned - Run
terraform planin CI on every PR; apply only on merge to main - Tag every resource for cost allocation and access control
- Use separate AWS accounts per environment, not just workspaces, for blast-radius isolation
Helm
- Never store secrets in
values.yamlcommitted to Git — use External Secrets Operator - Never use
latestas image tag — it breaks reproducibility and rollback - Use
--waitonhelm upgradeso CI fails if pods do not become ready - Bump
Chart.versionwhen chart structure changes;appVersionwhen app changes - Use
helm lint+helm templatein CI to catch rendering errors before cluster apply
ArgoCD
- Enable
selfHeal: trueto prevent drift from manual edits - Enable
prune: truecarefully — test in staging first; it deletes resources removed from Git - Use App of Apps from day one — retrofitting it later is painful
- Keep ArgoCD Application manifests in a separate
k8s-manifestsrepo - Use sync waves for ordering (CRDs before controllers, databases before APIs, migrations before new app versions)
- Use
AppProjectto scope team access — never run production in thedefaultproject
Agno Agents
- Use PostgreSQL storage in production, not SQLite — SQLite on a PVC does not survive pod rescheduling reliably
- Inject
ANTHROPIC_API_KEYvia External Secrets Operator from AWS Secrets Manager — never in Git - Set HPA targets at 60% CPU (lower than typical) — LLM workloads are bursty
- Give liveness probes a 60s
initialDelaySeconds— model clients take time to initialize connections - Use a dedicated
agentsnamespace — keeps LLM workloads isolated from your API workloads for resource management and RBAC
Follow-Up Questions to Go Deeper
On Terraform:
- How do you test Terraform modules in isolation using
terratestbefore merging? - When should you break a monolithic state file into multiple smaller state files? What does the cross-state data source pattern look like?
- How do you manage secrets (RDS passwords, API keys) in Terraform without storing them in
.tfvarsfiles? Compare Vault, AWS Secrets Manager, and SOPS. - What is the “drift detection” problem at scale, and how do tools like
driftctlor Terraform’s built-inrefreshcompare? - How do you implement progressive infrastructure rollouts (blue/green at the infrastructure layer, not the application layer)?
On Helm:
6. How do you build a shared library chart that provides common templates (_helpers.tpl) across 20 different application charts?
7. What is the Helm chart repository pattern, and how do you publish charts to a private OCI registry (ECR) versus a traditional index.yaml repo?
8. How does helm diff (the plugin) enable safer upgrades by showing the exact manifest diff before applying?
9. When does it make sense to switch from Helm to Kustomize, or to use both together (helm template | kustomize build)?
10. How do you manage Helm-installed secrets safely using helm-secrets with SOPS and age encryption?
On ArgoCD:
11. How does Argo Rollouts extend ArgoCD to support canary deployments and blue/green strategies at the Deployment level?
12. What is the ApplicationSet CRD, and how does it let you generate Application manifests for hundreds of microservices from a single template?
13. How do you configure ArgoCD to use SSO (Okta, GitHub) and map SSO groups to AppProject roles?
14. How does the Argo CD Image Updater work, and when should you use it instead of a CI-driven manifest commit?
15. What are the tradeoffs between running ArgoCD in the same cluster it manages versus a dedicated management cluster?
On Agno and AI Agents: 16. How do you implement multi-agent orchestration in Agno where a coordinator agent delegates to specialized sub-agents across different Kubernetes Deployments? 17. What observability stack (Prometheus, Grafana, OpenTelemetry) do you wire to Agno’s FastAPI layer for tracking LLM latency, token counts, and session counts? 18. How do you handle LLM API rate limits (Anthropic, OpenAI) in a horizontally scaled Agno deployment with multiple pods hitting the same API? 19. What is the pattern for A/B testing two LLM models (Claude Sonnet vs Haiku) behind the same Agno endpoint using Kubernetes traffic splitting? 20. How do you implement a cost circuit breaker that pauses the Agno agent HPA when daily LLM spend exceeds a threshold?
Sources
- HashiCorp. Terraform Documentation — Backend Configuration. https://developer.hashicorp.com/terraform/language/backend
- HashiCorp. Terraform Registry — AWS VPC Module. https://registry.terraform.io/modules/terraform-aws-modules/vpc/aws/latest
- HashiCorp. Terraform Registry — AWS EKS Module. https://registry.terraform.io/modules/terraform-aws-modules/eks/aws/latest
- HashiCorp. Terraform Registry — AWS IAM Module (IRSA). https://registry.terraform.io/modules/terraform-aws-modules/iam/aws/latest
- AWS Documentation. IAM Roles for Service Accounts (IRSA). https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html
- Helm. The Chart Template Developer’s Guide. https://helm.sh/docs/chart_template_guide/
- Helm. Chart Hooks. https://helm.sh/docs/topics/charts_hooks/
- Helm. Helm Best Practices. https://helm.sh/docs/chart_best_practices/
- Argo Project. ArgoCD — Getting Started. https://argo-cd.readthedocs.io/en/stable/getting_started/
- Argo Project. ArgoCD — Application CRD Reference. https://argo-cd.readthedocs.io/en/stable/operator-manual/application.yaml
- Argo Project. ArgoCD — App of Apps Pattern. https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/
- Argo Project. ArgoCD — Sync Waves and Hooks. https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/
- Argo Project. Argo CD Image Updater. https://argocd-image-updater.readthedocs.io/en/stable/
- Argo Project. Argo Rollouts — Progressive Delivery. https://argoproj.github.io/argo-rollouts/
- Agno. Agno Documentation — AgentOS. https://docs.agno.com/agentos/introduction
- Agno. Agno Documentation — Storage Backends. https://docs.agno.com/agents/storage
- Weaveworks. Guide to GitOps. https://www.weave.works/technologies/gitops/
- External Secrets Operator. Introduction and AWS Secrets Manager Integration. https://external-secrets.io/latest/provider/aws-secrets-manager/
- CNCF. GitOps Principles v1.0. https://opengitops.dev/
- GitHub Actions. OIDC Authentication to AWS. https://docs.github.com/en/actions/security-for-github-actions/security-hardening-your-deployments/configuring-openid-connect-in-amazon-web-services