Production AI Infrastructure

Production GitOps: Terraform, Helm, and ArgoCD for Node.js APIs and Agno Agents

A deep-dive into provisioning infrastructure with Terraform, packaging apps with Helm, and continuously deploying via ArgoCD — with real config for a Node.js API and an Agno Python agent. Structured across five expertise levels from Basic to Legendary.

29 min read

Modern production deployments involve three distinct layers: infrastructure provisioning, application packaging, and continuous delivery. This article ties all three together using Terraform, Helm, and ArgoCD — from first principles to architectural-level insights — showing how they compose into a coherent GitOps pipeline for both a Node.js REST API and an Agno Python agent (a FastAPI-based AI agent framework).

Each section builds across five expertise levels so you can read at your own depth.


The GitOps Model: Why Three Tools?

Before diving in, understand what problem each tool solves:

ToolProblem it solvesWhere it runs
Terraform”What infrastructure exists?”CI runner / developer machine
Helm”How is this app configured for each environment?”Git repository
ArgoCD”Does the cluster match what Git says?”Inside the cluster

GitOps is the model that ties them together. Git is the single source of truth. Every change — a new image tag, a scaling adjustment, a config tweak — enters the system as a commit. An automated reconciler continuously compares what Git says should exist against what actually runs in the cluster, and corrects any drift.

The two deployment models in practice:

  • Push-based: CI pipeline calls kubectl apply after merge. CI needs cluster credentials.
  • Pull-based: An in-cluster agent polls Git and applies changes. No external system needs cluster access.

ArgoCD uses the pull model. The agent inside the cluster does the work — which means your CI pipeline never needs a kubeconfig.


Data Flow: The Full Picture

┌─────────────────────────────────────────────────────────────┐
│  INFRASTRUCTURE LAYER (Terraform)                           │
│                                                             │
│  Developer  ──git push──►  CI Runner                       │
│                               │                            │
│                          terraform plan                     │
│                          terraform apply                    │
│                               │                            │
│              ┌────────────────┼───────────────┐            │
│              ▼                ▼               ▼            │
│           VPC + EKS        RDS (PG)        IAM/IRSA        │
│                               │                            │
│              outputs ─────────┼──────────────►             │
│              (cluster_endpoint, role_arns, rds_endpoint)   │
└───────────────────────────────┼─────────────────────────── ┘


┌─────────────────────────────────────────────────────────────┐
│  APPLICATION LAYER (Helm)                                   │
│                                                             │
│  k8s-manifests repo (Git)                                  │
│  ├── charts/                                               │
│  │   ├── node-api/         ← Helm chart                    │
│  │   │   ├── Chart.yaml                                    │
│  │   │   ├── values.yaml   ← defaults                      │
│  │   │   └── templates/    ← K8s manifest templates        │
│  │   └── agno-agent/       ← Helm chart                    │
│  ├── environments/                                         │
│  │   ├── staging/values.yaml  ← env overrides              │
│  │   └── production/values.yaml                            │
│  └── apps/                                                 │
│      ├── root.yaml         ← ArgoCD bootstrap              │
│      ├── node-api.yaml     ← ArgoCD Application CRD        │
│      └── agno-agent.yaml   ← ArgoCD Application CRD        │
└───────────────────────────────┬─────────────────────────── ┘


┌─────────────────────────────────────────────────────────────┐
│  DELIVERY LAYER (ArgoCD, inside EKS)                        │
│                                                             │
│  ArgoCD polls Git every 3m (or via webhook)                │
│       │                                                    │
│       ├─ helm template → renders manifests                 │
│       ├─ diff rendered vs live cluster state               │
│       │                                                    │
│       ├─ OutOfSync detected                                │
│       │   │                                               │
│       │   ├── PreSync hooks (db-migrate Job)              │
│       │   ├── Apply Deployment (rolling update)           │
│       │   └── PostSync hooks (smoke test)                 │
│       │                                                    │
│       └─ Status: Synced / Healthy                          │
└─────────────────────────────────────────────────────────────┘

CI Pipeline Flow: Developer Push to Running Pod

Developer

  │  git push origin main  (Node.js API source repo)

GitHub Actions
  ├─ 1. Run tests (npm test)
  ├─ 2. Authenticate to AWS via OIDC (no long-lived secrets)
  ├─ 3. docker build → push to ECR
  │       tag: acme/node-api:abc1234  (git SHA)
  └─ 4. Checkout k8s-manifests repo
         yq -i ".image.tag = \"abc1234\"" environments/production/values.yaml
         git commit + git push

k8s-manifests repo
  │  (now has new image tag in production values)

ArgoCD controller (in-cluster, polling every 3 min)
  │  Detects: image.tag 2.1.0 → abc1234 → OutOfSync

  ├─ Renders Helm chart with new values
  ├─ Runs PreSync hook: db-migrate Job (waits for completion)
  ├─ kubectl apply Deployment (RollingUpdate: maxUnavailable=0)
  │     K8s pulls acme/node-api:abc1234 from ECR
  │     New pods → readinessProbe /health → 200 OK → Ready
  │     Old pods → Terminating
  └─ PostSync hooks: smoke test, Slack notification

Cluster
  └─ 3 running pods of node-api:abc1234
     behind Service → Ingress → ALB → Internet

1. Terraform: Infrastructure as Code

Basic — What is it?

Terraform lets you describe your cloud infrastructure using code instead of clicking through UIs. You write files that say “I want a Kubernetes cluster with 3 worker nodes and a PostgreSQL database” and Terraform makes it happen. If you run it again, it only changes what’s different — it’s idempotent.

Think of it as a blueprint: the blueprint doesn’t change every time you look at it, but you can version it, review it, and roll it back.

Three things Terraform manages:

  • Providers: plugins that talk to cloud APIs (AWS, GCP, Azure)
  • Resources: individual cloud objects (VPC, EC2, RDS, S3 bucket)
  • State: a JSON file recording what Terraform created last time

Medium — Core Structure and Commands

infra/
├── main.tf          # Provider config and resource declarations
├── variables.tf     # Input variable declarations with types and validation
├── outputs.tf       # Values that other configs can consume
├── terraform.tfvars # Actual variable values (gitignore secrets!)
└── modules/
    ├── vpc/         # Reusable VPC component
    ├── eks/         # Reusable EKS component
    └── rds/         # Reusable RDS component

Essential commands:

terraform init          # Download providers and modules
terraform validate      # Check HCL syntax (no API calls)
terraform plan          # Preview changes — always review before applying
terraform apply         # Execute the plan
terraform output        # Print output values
terraform workspace new staging   # Create named state slice
terraform workspace select prod   # Switch environment context
terraform destroy       # Tear down (never run in prod without deliberation)

Advanced — Remote State, Modules, and IRSA

Never use local state in production. Store state in S3 with DynamoDB locking so multiple engineers work safely:

# main.tf
terraform {
  required_version = ">= 1.7"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.40"
    }
  }

  backend "s3" {
    bucket         = "acme-terraform-state"
    key            = "production/eks/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

provider "aws" {
  region = var.region

  default_tags {
    tags = {
      ManagedBy   = "terraform"
      Environment = var.environment
      Project     = "acme-api"
    }
  }
}

Variables with validation prevent typos from reaching terraform apply:

# variables.tf
variable "environment" {
  description = "Deployment environment"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

variable "db_password" {
  description = "Master password for RDS"
  type        = string
  sensitive   = true   # never printed in plan output
}

VPC + EKS using community modules (handles 200+ lines of boilerplate):

locals {
  cluster_name = "acme-${var.environment}"
}

data "aws_availability_zones" "available" { state = "available" }

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.5"

  name = "${local.cluster_name}-vpc"
  cidr = "10.0.0.0/16"
  azs  = slice(data.aws_availability_zones.available.names, 0, 3)

  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = false   # one NAT per AZ for HA
  enable_dns_hostnames = true

  # Required tags for EKS subnet auto-discovery
  private_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/internal-elb"             = "1"
  }
  public_subnet_tags = {
    "kubernetes.io/cluster/${local.cluster_name}" = "shared"
    "kubernetes.io/role/elb"                      = "1"
  }
}

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.8"

  cluster_name    = local.cluster_name
  cluster_version = "1.30"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  enable_irsa = true   # pods assume IAM roles without node-level credentials

  eks_managed_node_groups = {
    general = {
      instance_types = var.node_instance_types
      min_size       = local.node_min
      max_size       = local.node_max
      desired_size   = 3
      disk_size      = 50
    }
  }
}

IRSA (IAM Roles for Service Accounts): pods assume IAM roles directly — no credentials in env vars or Secrets:

module "node_api_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.39"

  role_name = "${local.cluster_name}-node-api"

  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["apps:node-api"]   # namespace:serviceaccount
    }
  }

  role_policy_arns = {
    s3 = aws_iam_policy.node_api_s3.arn
  }
}

Expert — Workspace Strategy, Drift Detection, and State Operations

Workspaces create separate state slices within one configuration:

locals {
  is_prod  = terraform.workspace == "prod"
  node_min = local.is_prod ? 3 : 1
  node_max = local.is_prod ? 20 : 5
  db_class = local.is_prod ? "db.r6g.large" : "db.t3.micro"
}
terraform workspace new staging
terraform workspace select staging
terraform apply -var-file=environments/staging.tfvars

Expert pitfall: workspaces share the same provider credentials. If production needs a separate AWS account (recommended for blast-radius isolation), use separate backend configs with separate state files, not just workspaces.

Importing existing resources into state (when you have manually created infra):

# Import an existing RDS instance into state without recreating it
terraform import aws_db_instance.main acme-prod

State surgery when things go wrong:

terraform state list             # list all managed resources
terraform state show aws_db_instance.main  # inspect a single resource
terraform state mv aws_s3_bucket.old aws_s3_bucket.new  # rename without destroy
terraform state rm aws_db_instance.main    # stop tracking without deleting

Legendary — The Philosophical Design Decision

Terraform’s biggest architectural bet is convergent reconciliation over imperative scripting. Rather than writing shell scripts that say “create X, then create Y, then configure Z,” you declare the target state and let the tool compute the shortest path to it.

This creates an immutable audit trail: terraform plan is a diff between desired and actual state. Every terraform apply is a committed state transition, not a side-effecting procedure. Combine this with remote state and you have a distributed coordination system — multiple engineers can propose plans simultaneously; only one can apply at a time (DynamoDB lock).

The implication for scaling teams: as your infrastructure grows, the right investment is not more HCL but better module composition. Each module should be independently testable (using terratest), versioned separately, and published to a private registry. The root configuration then composes modules the same way application code composes libraries — with version pinning, not copy-paste.

The known scaling ceiling: Terraform’s refresh phase reads every managed resource from cloud APIs on every plan. At 1,000+ resources in a single state file, this becomes a bottleneck (3–5 minute plans). The solution is state file partitioning by domain (networking, compute, databases, IAM) with cross-state data sources.


2. Helm: Packaging the Application

Basic — What is it?

Helm is a package manager for Kubernetes — like npm for Node.js or pip for Python. Instead of maintaining raw YAML files for every environment, you write templates with variables. Helm renders the templates with the right values for each environment and applies everything to the cluster.

A Helm chart is a folder with:

  • Chart.yaml — metadata (name, version)
  • values.yaml — default configuration
  • templates/ — Kubernetes manifests with {{ .Values.something }} placeholders

A Helm release is one deployed instance of a chart. You can have node-api-staging and node-api-prod from the same chart with different values.

Medium — Chart Structure and Core Templates

node-api/
├── Chart.yaml
├── values.yaml
├── .helmignore
└── templates/
    ├── _helpers.tpl        # Named templates (reusable partials)
    ├── deployment.yaml
    ├── service.yaml
    ├── ingress.yaml
    ├── hpa.yaml            # Horizontal Pod Autoscaler
    ├── serviceaccount.yaml
    ├── configmap.yaml
    └── hooks/
        └── db-migrate.yaml # Pre-upgrade migration Job
# Chart.yaml
apiVersion: v2
name: node-api
description: ACME Node.js REST API
type: application
version: 0.5.0       # chart version — bump when chart structure changes
appVersion: "2.1.0"  # application version — CI sets this from image tag
# values.yaml — defaults for all environments
replicaCount: 3

image:
  repository: 123456789.dkr.ecr.us-east-1.amazonaws.com/acme/node-api
  tag: "2.1.0"
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80
  targetPort: 3000

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "1000m"
    memory: "512Mi"

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: ""   # set per environment via IRSA

env:
  NODE_ENV: production
  PORT: "3000"
  LOG_LEVEL: info

envSecret:
  DB_URL: ""   # injected from a Kubernetes Secret

probes:
  readiness:
    path: /health
    initialDelaySeconds: 10
    periodSeconds: 5
  liveness:
    path: /health
    initialDelaySeconds: 30
    periodSeconds: 10

Key install/upgrade commands:

# Install (first time)
helm install node-api ./charts/node-api \
  --namespace apps \
  --create-namespace \
  --values environments/production/values.yaml \
  --set image.tag=2.1.0

# Upgrade (idempotent in CI — use --install flag)
helm upgrade --install node-api ./charts/node-api \
  --namespace apps \
  --values environments/production/values.yaml \
  --set image.tag=2.2.0 \
  --wait \
  --timeout 5m

# Rollback to the previous revision
helm rollback node-api 0   # 0 = previous revision

# View full release history
helm history node-api -n apps

Advanced — Templates, Helpers, and Hooks

_helpers.tpl defines named templates that are reused across all manifests:

{{/*
Fully qualified name (Release.Name-chart.Name, max 63 chars).
*/}}
{{- define "node-api.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}

{{/* Common labels for all resources */}}
{{- define "node-api.labels" -}}
helm.sh/chart: {{ printf "%s-%s" .Chart.Name .Chart.Version }}
app.kubernetes.io/name: {{ include "node-api.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Values.image.tag | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}

{{/* Selector labels — used by Service and Deployment */}}
{{- define "node-api.selectorLabels" -}}
app.kubernetes.io/name: {{ include "node-api.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

The Deployment template uses include + nindent for proper indentation, and the checksum/config annotation triggers a rollout whenever a ConfigMap changes:

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "node-api.fullname" . }}
  labels:
    {{- include "node-api.labels" . | nindent 4 }}
  annotations:
    # Force a rollout when ConfigMap content changes
    checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
spec:
  {{- if not .Values.autoscaling.enabled }}
  replicas: {{ .Values.replicaCount }}
  {{- end }}
  selector:
    matchLabels:
      {{- include "node-api.selectorLabels" . | nindent 6 }}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0   # zero-downtime: full capacity before removing old pods
  template:
    metadata:
      labels:
        {{- include "node-api.selectorLabels" . | nindent 8 }}
    spec:
      serviceAccountName: {{ include "node-api.fullname" . }}
      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          imagePullPolicy: {{ .Values.image.pullPolicy }}
          ports:
            - name: http
              containerPort: {{ .Values.service.targetPort }}
              protocol: TCP
          env:
            {{- range $key, $val := .Values.env }}
            - name: {{ $key }}
              value: {{ $val | quote }}
            {{- end }}
            {{- range $key, $val := .Values.envSecret }}
            - name: {{ $key }}
              valueFrom:
                secretKeyRef:
                  name: {{ include "node-api.fullname" $ }}-secret
                  key: {{ $key }}
            {{- end }}
          readinessProbe:
            httpGet:
              path: {{ .Values.probes.readiness.path }}
              port: http
            initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
            periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: {{ .Values.probes.liveness.path }}
              port: http
            initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
            periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
            failureThreshold: 3
          resources:
            {{- toYaml .Values.resources | nindent 12 }}

Pre-upgrade database migration hook: runs a Job before new pods start, blocks the rollout until migrations succeed:

# templates/hooks/db-migrate.yaml
{{- if .Values.migration.enabled }}
apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "node-api.fullname" . }}-migrate-{{ .Release.Revision }}
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": hook-succeeded
spec:
  backoffLimit: 3
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: migrate
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          command: ["node", "dist/db/migrate.js"]
          env:
            - name: DB_URL
              valueFrom:
                secretKeyRef:
                  name: {{ include "node-api.fullname" . }}-secret
                  key: DB_URL
{{- end }}

Expert — Helm Template Rendering Flow

helm upgrade node-api ./charts/node-api --values prod.yaml


      Load Chart.yaml metadata


      Merge values (precedence: --set > --values files > values.yaml)


      Render each template in templates/ via Go text/template
          ├─ Evaluate {{ define }} blocks in _helpers.tpl
          ├─ Resolve {{ include "..." . }} calls
          ├─ Apply | pipeline functions (nindent, quote, sha256sum, toYaml)
          └─ Conditional blocks: {{- if .Values.autoscaling.enabled }}


      Validate rendered YAML against Kubernetes OpenAPI schema


      Execute hooks (pre-upgrade Jobs) — wait for completion


      kubectl apply (server-side apply) the rendered manifests


      Wait for Deployment rollout if --wait flag set

Expert pitfalls:

  • Never generate rand.AlphaNum or randBytes in templates — they re-evaluate on every upgrade, causing spurious Secret updates on every deploy.
  • Never use latest as an image tag — it breaks reproducibility and prevents rollback.
  • helm lint + helm template in CI catches rendering errors before they reach the cluster.

Legendary — Helm as a Configuration Interface Contract

The deepest insight about Helm is that values.yaml is not configuration — it is a public API. Every key in values.yaml is a contract between the chart author and the chart consumer. Breaking changes (removing a key, changing its type) require a semver major bump of Chart.version.

This becomes critical at scale: if your platform team maintains a chart used by 20 application teams, each team’s CI pipeline sets --set image.tag=... and --values team-overrides.yaml. Any breaking change to the chart’s values schema must be versioned, communicated, and migrated — exactly like breaking changes in an npm package.

The advanced pattern is library charts: a type: library chart that contains only _helpers.tpl definitions, published separately to a Helm registry. Application charts depend on it (dependencies in Chart.yaml), enabling shared template logic across all your services. This is the Helm equivalent of a design system — one source of truth for labels, probes, resource defaults, and security contexts.


3. ArgoCD: Continuous GitOps Delivery

Basic — What is it?

ArgoCD watches a Git repository and keeps your Kubernetes cluster synchronized with it. You tell ArgoCD “this folder in this Git repo is the desired state of namespace X in cluster Y.” ArgoCD checks every few minutes. If anything drifts — someone manually edits a Deployment, a new config is pushed — ArgoCD corrects it.

The key concept: you never kubectl apply in production. Git is the only way to change production. ArgoCD is the enforcer.

Medium — The Application CRD

The core ArgoCD primitive is the Application custom resource. It’s the link between Git and the cluster:

# apps/node-api.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: node-api
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io  # cascade-delete on removal
spec:
  project: acme

  source:
    repoURL: https://github.com/acme-org/k8s-manifests.git
    targetRevision: main
    path: charts/node-api
    helm:
      releaseName: node-api
      valueFiles:
        - values.yaml
        - environments/production/values.yaml
      parameters:
        - name: image.tag
          value: "2.1.0"   # CI updates this on every build

  destination:
    server: https://kubernetes.default.svc
    namespace: apps

  syncPolicy:
    automated:
      prune: true       # delete resources removed from Git
      selfHeal: true    # revert manual cluster edits
      allowEmpty: false # never sync to empty state
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

  revisionHistoryLimit: 10

Installing ArgoCD:

kubectl create namespace argocd
kubectl apply -n argocd --server-side \
  -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

# Get initial admin password
argocd admin initial-password -n argocd

# Access the UI locally
kubectl port-forward svc/argocd-server -n argocd 8080:443
argocd login localhost:8080

Advanced — App of Apps, Projects, and Sync Waves

AppProject scopes what repos, clusters, and namespaces a team’s apps can touch:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: acme
  namespace: argocd
spec:
  description: ACME production applications
  sourceRepos:
    - https://github.com/acme-org/k8s-manifests.git
  destinations:
    - server: https://kubernetes.default.svc
      namespace: apps
    - server: https://kubernetes.default.svc
      namespace: agents
  clusterResourceWhitelist:
    - group: ""
      kind: Namespace
  roles:
    - name: developer
      policies:
        - p, proj:acme:developer, applications, get,  acme/*, allow
        - p, proj:acme:developer, applications, sync, acme/*, allow
      groups:
        - acme-developers   # SSO group mapping

App of Apps bootstraps an entire environment from Git. One root Application manages all other Application manifests:

k8s-manifests/
└── apps/
    ├── root.yaml          ← applied once manually to bootstrap
    ├── node-api.yaml
    ├── agno-agent.yaml
    ├── monitoring.yaml
    └── ingress-nginx.yaml
# apps/root.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: root-apps
  namespace: argocd
spec:
  project: acme
  source:
    repoURL: https://github.com/acme-org/k8s-manifests.git
    targetRevision: main
    path: apps              # directory of Application manifests
  destination:
    server: https://kubernetes.default.svc
    namespace: argocd
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

After kubectl apply -f apps/root.yaml once, ArgoCD manages itself and every child Application from Git. Adding a new service = adding one YAML file and committing.

Sync Waves enforce ordering within a single sync:

# Wave -1: deploy databases first
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "-1"
# Wave 0 + PreSync hook: run migrations after DB is ready
metadata:
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/sync-wave: "0"
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
# Wave 1: deploy the API after migrations complete
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: "1"

Custom health check in Lua (for CRDs without built-in health):

# argocd-cm ConfigMap
data:
  resource.customizations.health.batch_CronJob: |
    hs = {}
    if obj.status ~= nil then
      if obj.status.lastScheduleTime ~= nil then
        hs.status = "Healthy"
        hs.message = "Last scheduled: " .. obj.status.lastScheduleTime
        return hs
      end
    end
    hs.status = "Progressing"
    hs.message = "Waiting for first schedule"
    return hs

Expert — Reconciliation Loop Internals

ArgoCD Application Controller (reconcile loop, every 3 minutes or webhook-triggered)

         ├─ 1. DESIRED STATE: git clone / sparse-checkout the target path
         │         run: helm template <releaseName> <path> --values ...
         │         output: list of Kubernetes manifests (JSON)

         ├─ 2. LIVE STATE: kubectl get all resources in destination namespace
         │         output: current cluster state (JSON)

         ├─ 3. DIFF: three-way merge
         │         base:    last-applied annotation on live objects
         │         desired: rendered manifests from step 1
         │         live:    current cluster state from step 2
         │         result:  list of changes (add / modify / delete)

         ├─ 4. if diff is empty → Application is Synced, skip

         └─ 5. if diff exists → Application is OutOfSync
                   ├─ if syncPolicy.automated → trigger sync
                   └─ if manual → wait for user action


                   Execute sync:
                   ├─ PreSync hooks (Jobs) → wait for completion
                   ├─ Apply wave -1 resources → wait for health
                   ├─ Apply wave 0 resources → wait for health
                   ├─ Apply wave 1 resources → wait for health
                   └─ PostSync hooks

Expert pitfall: prune: true will delete Kubernetes resources that are no longer in Git — including ConfigMaps you added manually, PVCs, Secrets managed outside ArgoCD. Test pruning behavior in staging with --dry-run before enabling in production.

Legendary — The Philosophical Shift: Operational Events as Commits

ArgoCD’s deepest implication is organizational, not technical. When every production change is a Git commit, on-call incidents change shape. “What changed at 2am?” is a git log command, not an Ops investigation. “Roll back the broken deploy” is git revert + wait 3 minutes, not a Helm rollback command with stale state.

The scaling challenge: GitOps creates pressure to separate the application repo from the manifests repo. If you store Helm values in the same repo as Node.js source code, every git push triggers both a code build and an ArgoCD sync evaluation. At scale you want: application repo for source changes (triggers CI, produces images), manifests repo for deployment changes (ArgoCD watches this). CI’s only job is to update the image tag in the manifests repo. This architecture is called push-to-deploy via image updater and can be partially automated with the Argo CD Image Updater project.


4. Deploying an Agno Agent

Basic — What is Agno?

Agno is a Python framework for building AI agents. You define an agent (what model it uses, what tools it has, how it stores memory), and Agno wraps it in a FastAPI application — a full HTTP API server with streaming, session management, and built-in /health endpoints.

From a deployment perspective, an Agno agent is just a FastAPI app. The same Helm + ArgoCD workflow applies. The differences are: it needs an LLM API key as a Secret, it uses more memory (LLM inference buffers), and it needs a persistent database for session state — not a local SQLite file.

Medium — Agno App and Dockerfile

# agent.py
from agno.agent import Agent
from agno.models.anthropic import Claude
from agno.storage.postgres import PostgresStorage
from agno.run.fastapi import AgentOS
from fastapi import FastAPI

agent = Agent(
    model=Claude(id="claude-sonnet-4-5"),
    description="ACME support agent",
    instructions=["Always be concise", "Use bullet points"],
    storage=PostgresStorage(
        table_name="agent_sessions",
        db_url=os.environ["DATABASE_URL"],
    ),
    add_history_to_messages=True,
    num_history_responses=5,
)

app: FastAPI = AgentOS(agents=[agent]).get_app()
# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# System deps for PostgreSQL driver
RUN apt-get update && apt-get install -y \
    libpq-dev gcc \
    && rm -rf /var/lib/apt/lists/*

# Layer caching: install deps before copying source
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY agent.py .

# Non-root user for security
RUN useradd --create-home appuser
USER appuser

EXPOSE 8000

CMD ["fastapi", "run", "agent.py", "--host", "0.0.0.0", "--port", "8000"]
# requirements.txt
agno[os]>=1.0.0
anthropic>=0.30.0
fastapi>=0.111.0
uvicorn[standard]>=0.29.0
psycopg2-binary>=2.9.9

Advanced — Helm Values and ArgoCD Application for Agno

The Agno chart is structurally identical to the Node.js chart. Key differences in values.yaml:

# charts/agno-agent/values.yaml
replicaCount: 2

image:
  repository: 123456789.dkr.ecr.us-east-1.amazonaws.com/acme/agno-agent
  tag: "1.0.0"
  pullPolicy: IfNotPresent

service:
  type: ClusterIP
  port: 80
  targetPort: 8000

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "2000m"
    memory: "2Gi"   # LLM inference buffers are large

envSecrets:
  ANTHROPIC_API_KEY: ""   # never in Git — injected from K8s Secret
  DATABASE_URL: ""         # postgres://user:pass@rds-endpoint:5432/acme

probes:
  readiness:
    path: /health
    initialDelaySeconds: 15   # model clients need time to initialize
    periodSeconds: 10
  liveness:
    path: /health
    initialDelaySeconds: 60   # longer delay — LLM client startup is slow
    periodSeconds: 30

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60   # scale earlier; LLM calls are bursty

persistence:
  enabled: false   # use PostgreSQL, not local SQLite

ArgoCD Application (keep AI workloads in a dedicated namespace):

# apps/agno-agent.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: agno-agent
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: acme
  source:
    repoURL: https://github.com/acme-org/k8s-manifests.git
    targetRevision: main
    path: charts/agno-agent
    helm:
      releaseName: agno-agent
      valueFiles:
        - values.yaml
        - environments/production/agno-values.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: agents   # dedicated namespace for AI workloads
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true
      - ServerSideApply=true
    retry:
      limit: 3
      backoff:
        duration: 10s
        factor: 2
        maxDuration: 2m

Production value overrides:

# environments/production/agno-values.yaml
replicaCount: 3

image:
  tag: "1.2.0"

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::123456789:role/acme-prod-agno-agent"

resources:
  requests:
    cpu: "1000m"
    memory: "1Gi"
  limits:
    cpu: "4000m"
    memory: "4Gi"

autoscaling:
  maxReplicas: 20

env:
  DATABASE_POOL_SIZE: "10"

Expert — Secret Management for LLM Keys

The ANTHROPIC_API_KEY must never appear in Git. The production pattern is External Secrets Operator (ESO) pulling from AWS Secrets Manager:

# ExternalSecret — creates a K8s Secret from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: agno-agent-secrets
  namespace: agents
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: agno-agent-secret   # creates this K8s Secret
    creationPolicy: Owner
  data:
    - secretKey: ANTHROPIC_API_KEY
      remoteRef:
        key: acme/prod/agno-agent
        property: anthropic_api_key
    - secretKey: DATABASE_URL
      remoteRef:
        key: acme/prod/agno-agent
        property: database_url

This ExternalSecret manifest lives in the manifests repo and is applied by ArgoCD. The actual secret values live only in AWS Secrets Manager — never in Git.

Legendary — Agno Architecture Implications at Scale

Agno’s stateless FastAPI design (agent logic + HTTP layer, state in PostgreSQL) maps cleanly to Kubernetes horizontal scaling. But LLM-serving workloads break HPA assumptions: CPU utilization during an LLM call is low (the work happens in the API provider), but latency is high and concurrency is limited by rate limits, not compute.

This means CPU-based HPA will under-scale. The advanced pattern is custom metrics HPA using in-flight request count or queue depth:

# HPA based on in-flight requests (via KEDA or custom metrics adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
    - type: External
      external:
        metric:
          name: agno_agent_in_flight_requests
          selector:
            matchLabels:
              app: agno-agent
        target:
          type: AverageValue
          averageValue: "5"   # scale when each pod handles more than 5 concurrent requests

For multi-agent architectures (Agno supports orchestrating multiple specialized agents), each agent type should have its own Deployment with independent HPA targets tuned to its workload pattern. A research-agent that makes many sequential LLM calls has different scaling behavior than a summary-agent that makes one call per request.


5. CI Pipeline: Tying It All Together

# .github/workflows/deploy.yml
name: Build and Deploy

on:
  push:
    branches: [main]
    paths: ["src/**", "package*.json", "Dockerfile"]

env:
  AWS_REGION: us-east-1
  ECR_REGISTRY: 123456789.dkr.ecr.us-east-1.amazonaws.com
  IMAGE_NAME: acme/node-api

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      id-token: write   # required for OIDC auth to AWS (no long-lived keys)
      contents: read
    outputs:
      image_tag: ${{ steps.meta.outputs.version }}

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS via OIDC
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789:role/github-ecr-push
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to ECR
        uses: aws-actions/amazon-ecr-login@v2

      - name: Docker metadata (generates reproducible image tags)
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.ECR_REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: type=sha,prefix=,format=short

      - name: Build and push (with layer caching)
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

  update-manifests:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Checkout k8s-manifests repo
        uses: actions/checkout@v4
        with:
          repository: acme-org/k8s-manifests
          token: ${{ secrets.MANIFESTS_REPO_TOKEN }}

      - name: Update image tag with yq
        run: |
          yq -i ".image.tag = \"${{ needs.build-and-push.outputs.image_tag }}\"" \
            environments/production/values.yaml

      - name: Commit and push
        run: |
          git config user.name "GitHub Actions"
          git config user.email "actions@github.com"
          git add environments/production/values.yaml
          git commit -m "chore(node-api): bump image to ${{ needs.build-and-push.outputs.image_tag }}"
          git push

ArgoCD detects the manifest commit within 3 minutes and initiates a sync — no kubectl, no direct cluster access from CI.


6. Production Checklist

Terraform

  • Use remote state with DynamoDB locking — never local state in production
  • Mark secrets sensitive = true — keeps them out of plan logs
  • Pin providers with ~> (minor-compatible), not unpinned
  • Run terraform plan in CI on every PR; apply only on merge to main
  • Tag every resource for cost allocation and access control
  • Use separate AWS accounts per environment, not just workspaces, for blast-radius isolation

Helm

  • Never store secrets in values.yaml committed to Git — use External Secrets Operator
  • Never use latest as image tag — it breaks reproducibility and rollback
  • Use --wait on helm upgrade so CI fails if pods do not become ready
  • Bump Chart.version when chart structure changes; appVersion when app changes
  • Use helm lint + helm template in CI to catch rendering errors before cluster apply

ArgoCD

  • Enable selfHeal: true to prevent drift from manual edits
  • Enable prune: true carefully — test in staging first; it deletes resources removed from Git
  • Use App of Apps from day one — retrofitting it later is painful
  • Keep ArgoCD Application manifests in a separate k8s-manifests repo
  • Use sync waves for ordering (CRDs before controllers, databases before APIs, migrations before new app versions)
  • Use AppProject to scope team access — never run production in the default project

Agno Agents

  • Use PostgreSQL storage in production, not SQLite — SQLite on a PVC does not survive pod rescheduling reliably
  • Inject ANTHROPIC_API_KEY via External Secrets Operator from AWS Secrets Manager — never in Git
  • Set HPA targets at 60% CPU (lower than typical) — LLM workloads are bursty
  • Give liveness probes a 60s initialDelaySeconds — model clients take time to initialize connections
  • Use a dedicated agents namespace — keeps LLM workloads isolated from your API workloads for resource management and RBAC

Follow-Up Questions to Go Deeper

On Terraform:

  1. How do you test Terraform modules in isolation using terratest before merging?
  2. When should you break a monolithic state file into multiple smaller state files? What does the cross-state data source pattern look like?
  3. How do you manage secrets (RDS passwords, API keys) in Terraform without storing them in .tfvars files? Compare Vault, AWS Secrets Manager, and SOPS.
  4. What is the “drift detection” problem at scale, and how do tools like driftctl or Terraform’s built-in refresh compare?
  5. How do you implement progressive infrastructure rollouts (blue/green at the infrastructure layer, not the application layer)?

On Helm: 6. How do you build a shared library chart that provides common templates (_helpers.tpl) across 20 different application charts? 7. What is the Helm chart repository pattern, and how do you publish charts to a private OCI registry (ECR) versus a traditional index.yaml repo? 8. How does helm diff (the plugin) enable safer upgrades by showing the exact manifest diff before applying? 9. When does it make sense to switch from Helm to Kustomize, or to use both together (helm template | kustomize build)? 10. How do you manage Helm-installed secrets safely using helm-secrets with SOPS and age encryption?

On ArgoCD: 11. How does Argo Rollouts extend ArgoCD to support canary deployments and blue/green strategies at the Deployment level? 12. What is the ApplicationSet CRD, and how does it let you generate Application manifests for hundreds of microservices from a single template? 13. How do you configure ArgoCD to use SSO (Okta, GitHub) and map SSO groups to AppProject roles? 14. How does the Argo CD Image Updater work, and when should you use it instead of a CI-driven manifest commit? 15. What are the tradeoffs between running ArgoCD in the same cluster it manages versus a dedicated management cluster?

On Agno and AI Agents: 16. How do you implement multi-agent orchestration in Agno where a coordinator agent delegates to specialized sub-agents across different Kubernetes Deployments? 17. What observability stack (Prometheus, Grafana, OpenTelemetry) do you wire to Agno’s FastAPI layer for tracking LLM latency, token counts, and session counts? 18. How do you handle LLM API rate limits (Anthropic, OpenAI) in a horizontally scaled Agno deployment with multiple pods hitting the same API? 19. What is the pattern for A/B testing two LLM models (Claude Sonnet vs Haiku) behind the same Agno endpoint using Kubernetes traffic splitting? 20. How do you implement a cost circuit breaker that pauses the Agno agent HPA when daily LLM spend exceeds a threshold?


Sources