Most teams meet AI infrastructure the easy way: they call a hosted model API and ship. That works until the bill, the latency, or the data-residency rules force a harder question: what does it actually take to run models yourself? The answer is a stack — compute, data, orchestration, training, serving, and operations — and a shift the industry made loud at the start of 2026: the expensive, hard part is no longer training models, it is serving them (NVIDIA AI inference).
This article is a quick look at that stack, leaning hands-on. We’ll self-host an open-weight LLM on Kubernetes with vLLM and KServe, read the metrics that matter, do the GPU cost math, and finish with the skills engineers and PMs need to be useful around this stack.
TL;DR
- AI infrastructure is a layered system, not a single product: GPU compute → data/storage → orchestration → training → inference/serving → MLOps and observability. A common open-source recipe is Kubernetes + Ray + PyTorch + vLLM (Anyscale).
- 2026 is the “inference economy.” NVIDIA states its Blackwell Ultra (GB300 NVL72) delivers up to 35x lower cost per token than Hopper for low-latency agentic workloads, and frames inference — not training — as the dominant cost (NVIDIA AI inference).
- The hands-on default for self-hosting LLMs is vLLM (the engine, with PagedAttention + continuous batching and an OpenAI-compatible API) wrapped by KServe (autoscaling, canaries, routing) on Kubernetes (vLLM docs, KServe LLMInferenceService).
- GPU utilization is the budget. Dedicated GPUs often idle at 25–35%; adding a queueing layer like Kueue plus MIG/time-slicing can push utilization to 60–85% (KubernetesGuru 2026 stack).
- Watch two latency numbers: time-to-first-token (TTFT) and time-per-output-token (TPOT). vLLM exposes both via Prometheus (KubernetesGuru).
- MLOps maturity is about automation, not tools. Google Cloud defines three levels: 0 (manual), 1 (pipeline automation / continuous training), 2 (full CI/CD) (Google Cloud MLOps).
- The skill core is boring and durable: Python, SQL, Linux, Docker, Kubernetes, one cloud, and the ability to reason about GPUs, latency, and cost.
What You Will Learn Here
- A simple mental model for the ML/AI infrastructure stack, layer by layer
- Why the industry pivoted from a training focus to an inference focus in 2026
- How to self-host an open-weight LLM on Kubernetes with vLLM and KServe (with real config)
- How an inference request actually flows through the stack (prefill, decode, KV cache)
- The GPU and cost metrics that decide whether your deployment is healthy or bleeding money
- A concrete skills map for engineers — and a short “what PMs need to know” lens
What “AI Infrastructure” Actually Means
Strip away the logos and AI infrastructure is just the layers that get a model from “weights on disk” to “answer in production,” reliably and affordably. A useful way to read it bottom-up:
+-------------------------------------------------------------+
| Application layer |
| apps, internal tools, agent frameworks |
+-----------------------------+-------------------------------+
|
+-----------------------------+-------------------------------+
| Inference / serving |
| vLLM (engine) wrapped by KServe; Triton for multi-model |
+-----------------------------+-------------------------------+
|
+-----------------------------+-------------------------------+
| Orchestration / scheduling |
| Kubernetes + Kueue (GPU queueing) + Karpenter (nodes) |
+-----------------------------+-------------------------------+
|
+-----------------------------+-------------------------------+
| GPU compute layer |
| NVIDIA GPU Operator: drivers, runtime, MIG, DCGM metrics |
+-----------------------------+-------------------------------+
|
+-----------------------------+-------------------------------+
| Training / experimentation |
| PyTorch + Ray (KubeRay); MLflow tracking + model registry |
+-----------------------------+-------------------------------+
|
+-----------------------------+-------------------------------+
| Data / storage |
| object storage (S3/GCS), data pipelines, feature stores |
+-------------------------------------------------------------+
Anyscale describes the common open-source recipe in three reusable roles: a training/inference framework (PyTorch, vLLM) that runs models fast on GPUs, a distributed compute engine (Ray) that schedules work and handles failures across a job, and a container orchestrator (Kubernetes) that allocates nodes and runs jobs (Anyscale). That separation is the mental model to keep: frameworks run the math, an engine coordinates a job, Kubernetes coordinates the cluster.
For PMs: think of it like a restaurant. The GPU is the stove, Kubernetes is the kitchen floor plan, the queueing layer is the expediter deciding which orders cook next, and vLLM/KServe is the line cook turning raw ingredients (tokens) into plated dishes (responses). When the bill spikes, it’s almost always the stove sitting idle, not the recipe.
The 2026 Shift: From Training Era to Inference Economy
For years, AI infrastructure was optimized for training: bigger models, more FLOPs, larger clusters. In 2026 that center of gravity moved. NVIDIA’s own inference page frames it bluntly — every token an agent produces is an inference call, and at scale that demand compounds, so the cost of AI shifts from building models to running them (NVIDIA AI inference).
The hardware claims reflect that priority. NVIDIA states the GB300 NVL72 (72 Blackwell Ultra GPUs in one rack, joined by a 130 TB/s NVLink fabric) delivers up to 50x higher throughput per megawatt and up to 35x lower cost per token than Hopper for low-latency agentic workloads, citing SemiAnalysis InferenceX benchmarks from Q1 2026 (NVIDIA AI inference).
The practical takeaway for builders is not “buy Blackwell.” It is the underlying lesson:
- Inference plays by different rules than training: latency and cost-per-token matter more than peak FLOPs.
- The infrastructure winners are the teams that serve intelligence efficiently, not the ones with the most GPUs.
- A big modern technique is disaggregated serving — splitting the compute-heavy prefill phase and the latency-sensitive decode phase across different GPUs so each can be optimized independently (NVIDIA Blackwell Ultra MLPerf).
That is why the rest of this article is mostly about serving.
A Hands-On Reference Stack: Self-Hosting LLM Inference on Kubernetes
AI/ML on Kubernetes crossed the “production default” line in 2025–2026, consolidating around a specific set of CNCF and vendor-backed tools (KubernetesGuru). Here is the minimum useful set for self-hosting an LLM, and what each piece does.
| Layer | Tool | Job |
|---|---|---|
| Inference engine | vLLM | Loads the model, batches requests, serves an OpenAI-compatible API |
| Serving framework | KServe | Wraps vLLM with autoscaling, canaries, routing, scale-to-zero |
| GPU scheduling | Kueue | Fair-share GPU queueing and quotas across teams |
| Node provisioning | Karpenter | Adds/removes GPU nodes (incl. spot) on demand |
| GPU management | NVIDIA GPU Operator | Drivers, container runtime, MIG, DCGM metrics |
| Observability | DCGM + Prometheus + Grafana | GPU + inference metrics and alerts |
Step 1: The engine — vLLM
vLLM (from UC Berkeley’s Sky Computing Lab) is the 2026 default LLM inference engine. Its two big ideas are PagedAttention (manages the KV cache like virtual memory, so GPU memory isn’t wasted) and continuous batching (new requests join the running batch instead of waiting for it to finish). It also ships an OpenAI-compatible HTTP API, so existing client code mostly just works (KubernetesGuru).
The simplest possible run, on a single GPU box, is one command:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--port 8000
Now it speaks the OpenAI dialect. Any OpenAI client can point at it by swapping the base URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed-self-hosted")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Explain KV cache in one sentence."}],
)
print(resp.choices[0].message.content)
That is the whole point of self-hosting: your application code does not change, only where the tokens are produced.
Step 2: The serving framework — KServe
A bare vLLM pod has no autoscaling, no canary rollouts, no multi-model routing, and no scale-to-zero. KServe (a CNCF-graduated project) adds those Kubernetes-native capabilities and can wrap vLLM, Triton, or custom containers (KubernetesGuru). In KServe v0.17+, generative workloads get a purpose-built CRD, LLMInferenceService, built on the llm-d framework for routing and distributed inference (KServe).
A minimal single-node deployment of Llama-3.1-8B looks like this:
apiVersion: serving.kserve.io/v1alpha1
kind: LLMInferenceService
metadata:
name: llama-3-8b
namespace: default
spec:
model:
uri: hf://meta-llama/Llama-3.1-8B-Instruct
name: meta-llama/Llama-3.1-8B-Instruct
replicas: 3
template:
containers:
- name: main
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: "1"
cpu: "8"
memory: 32Gi
router:
gateway: {} # managed Gateway API ingress
route: {} # managed HTTPRoute
scheduler: {} # managed load-balancing scheduler
Applying that one file creates three replica pods (1 GPU each), a Gateway for external traffic, an HTTPRoute, and an intelligent scheduler for load balancing plus a storage initializer to download the model (KServe). For frontier models too big for one node (think 70B+), you add a parallelism block (tensor/data parallel) and a worker spec, which switches KServe to a LeaderWorkerSet for multi-node serving (KServe v0.17 release).
Step 3: Don’t skip GPU scheduling — Kueue
Here is the failure mode nobody warns you about: without a queueing layer, one team’s giant job grabs all the GPUs and everyone else stalls (head-of-line blocking). Kueue (CNCF) adds fair-share scheduling, quotas, and gang scheduling for distributed jobs. The reported impact is large: GPU utilization rising from a 25–35% baseline to 60–85% (KubernetesGuru). On a GPU bill, that difference is the difference between a sustainable service and a runaway cost center.
How it fits together
client request (OpenAI-format JSON)
|
v
Gateway API ──► KServe scheduler (load balance / prefix-aware route)
|
v
vLLM pod (replica) [scheduled by Kueue onto a GPU slot]
|
+--► PagedAttention KV cache (GPU memory)
+--► continuous batching (joins in-flight batch)
|
v
GPU ── NVIDIA GPU Operator (driver, runtime, MIG slice) ──► DCGM metrics
|
v
streamed tokens back to the client
The Inference Request Path
It helps to know what actually happens inside that vLLM pod, because it explains every metric and cost lever later.
PROMPT IN TOKENS OUT
| ^
v |
[ Prefill ] read the whole prompt, build KV cache |
| compute-heavy, runs once per request |
v |
[ Decode ] generate one token, append to KV cache ---+
| memory-bandwidth-heavy, loops per output token
| continuous batching interleaves many requests here
v
stop (EOS or max tokens)
Two phases, two personalities:
- Prefill is compute-bound and happens once: it reads the prompt and fills the KV cache. This dominates time-to-first-token (TTFT).
- Decode is memory-bandwidth-bound and loops once per output token. This dominates time-per-output-token (TPOT).
This is exactly why “disaggregated serving” exists — splitting prefill and decode onto different GPUs lets each phase use the parallelism and hardware that suits it (NVIDIA Blackwell Ultra MLPerf). And it is why the KV cache is the most precious resource on the GPU: PagedAttention exists specifically to stop you wasting it.
Observability: The Metrics That Matter
Standard pod CPU/memory dashboards are blind to GPU reality. You need two extra metric families (KubernetesGuru).
From DCGM Exporter (GPU health):
DCGM_FI_DEV_GPU_UTIL— GPU utilization %DCGM_FI_DEV_FB_USED— GPU memory (frame buffer) usedDCGM_FI_DEV_POWER_USAGE— power drawDCGM_FI_DEV_GPU_TEMP— temperature / throttling signal
From vLLM (inference health):
vllm:time_to_first_token_seconds— TTFT (the prefill story)vllm:time_per_output_token_seconds— TPOT (the decode story)vllm:gpu_cache_usage_perc— KV cache pressurevllm:num_requests_running— live concurrency
The single highest-value alert is a P95 TTFT regression — it is usually the first sign that a model update or a traffic-shape change broke your latency budget:
ALERT HighTTFT
EXPR histogram_quantile(0.95,
sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le)) > 2
FOR 10m
LABELS severity = "page"
ANNOT "P95 time-to-first-token > 2s for 10m"
A healthy dashboard puts GPU utilization next to inference latency. If utilization is low and latency is high, you are paying for idle silicon while users wait — the worst of both worlds, and almost always a scheduling or batching problem rather than a model problem.
Cost and GPU Economics
GPUs are the line item that dwarfs everything else, so a little arithmetic goes a long way. The mechanics that move the bill:
- Utilization is the budget. A dedicated H100 idling at 30% costs the same per hour as one running at 85%. Closing that gap with Kueue + MIG is often the biggest single cost win available (KubernetesGuru).
- MIG / time-slicing share one physical GPU. MIG partitions an A100/H100/H200 into up to 7 isolated instances; time-slicing shares one GPU across pods with less isolation but works on all generations (KubernetesGuru).
- Scale-to-zero saves money but costs latency. KServe can scale idle models to zero replicas, but the first request after that pays a 30–120s cold start while the model loads. For latency-critical paths, keep
minReplicas > 0(KubernetesGuru). - Spot + Karpenter cuts node price for interruptible training and batch work.
- Newer hardware changes the cost-per-token curve, which is the metric that actually maps to a per-request bill at scale (NVIDIA AI inference).
A back-of-the-envelope model that makes the utilization point concrete:
def cost_per_million_tokens(gpu_hourly_usd, tokens_per_sec, utilization):
effective_tps = tokens_per_sec * utilization
tokens_per_hour = effective_tps * 3600
return gpu_hourly_usd / (tokens_per_hour / 1_000_000)
# Same GPU, same throughput ceiling — only utilization changes.
idle = cost_per_million_tokens(3.00, 2500, 0.30) # ~$1.11 / M tokens
tuned = cost_per_million_tokens(3.00, 2500, 0.85) # ~$0.39 / M tokens
These numbers are illustrative, not a quote — real cost-per-token depends on model, sequence length, batch size, quantization, and hardware. The point is the ratio: lifting utilization from 30% to 85% cut the effective unit cost by roughly 65% without touching the model. That is why scheduling and batching, not model swaps, are usually the first place to optimize.
For PMs: the three numbers worth tracking are cost per million tokens, P95 TTFT (does it feel instant?), and GPU utilization (are we wasting the budget?). Most “we need more GPUs” requests are really “we are using the GPUs we have at 30%.”
Where MLOps Fits
Serving is half the story; the other half is the lifecycle around the model. Google Cloud’s widely used framing measures MLOps by how automated the process is, in three levels (Google Cloud MLOps):
Level 0 Manual notebooks -> hand-built model -> manual deploy
Level 1 Pipeline auto automated pipeline retrains on new data (continuous training)
Level 2 CI/CD build, test, deploy pipeline components automatically
Most teams start at Level 0 — capable data scientists, but a manual, script-driven path to production. Level 1 automates the training pipeline so models continuously retrain on fresh data with automated data/model validation. Level 2 adds a real CI/CD system so you can rapidly and reliably ship new pipeline components, not just new model weights (Google Cloud MLOps). The supporting cast — experiment tracking and a model registry (MLflow), pipeline orchestration (Airflow/Kubeflow), and GitOps (ArgoCD) — is how you climb those levels (KubernetesGuru).
A pragmatic note: if you are only serving open-weight models behind vLLM/KServe and not training your own, you may live happily between Level 0 and Level 1, with most of your rigor going into evaluation, canary rollouts, and observability rather than continuous training.
The Skills You Need
Tools churn; the foundations don’t. Here is the map by layer, ordered roughly from “learn first” to “learn when you need it.”
Foundations (non-negotiable)
- Python — the lingua franca of ML and data tooling (MLOps roadmap)
- SQL — you will always be moving and shaping data
- Linux, the shell, and Git — everything runs on these
- Basic ML literacy — what training, inference, tokens, and context windows are
Data layer
- Data pipelines / ETL thinking
- Distributed processing (Spark / PySpark) for datasets too big for one machine (Dataquest roadmap)
- Object storage (S3/GCS) and data/feature versioning
Containers and orchestration
- Docker — build small, reproducible images
- Kubernetes concepts — pods, deployments, services (you don’t need to run a cluster to be useful, but you must speak the language) (Dataquest roadmap)
- One cloud provider (AWS, GCP, or Azure) end to end
GPU and compute
- How GPU memory works, and why the KV cache dominates LLM serving
- GPU sharing: MIG vs time-slicing, and when each fits (KubernetesGuru)
- Distributed training basics: data, tensor, and pipeline parallelism (PyTorch DDP, etc.)
Training and inference
- PyTorch for model code
- Ray / KubeRay for distributed training and tuning (Ray docs)
- vLLM / KServe / Triton for serving, plus quantization and batching knobs
Operations
- MLOps: experiment tracking, model registry, CI/CD for ML (Google Cloud MLOps)
- Observability: Prometheus/Grafana, DCGM, OpenTelemetry, and SLOs
- Cost/FinOps: cost-per-token, utilization, Kubecost/OpenCost, spot strategy
- Security and governance: secrets, data residency, model inventory
The recurring advice across every roadmap: don’t try to learn every tool at once. Build one end-to-end project — ingest data, train or fine-tune a small model, deploy it behind an API, and monitor it — and prioritize understanding the system architecture, because the specific tools will keep changing (MLOps roadmap, Dataquest roadmap).
What PMs need (the lighter lens)
- The cost driver is GPU utilization and cost-per-token, not “the model is expensive.”
- Latency is two numbers: TTFT (responsiveness) and TPOT (streaming speed).
- Build vs buy hinges on cost at scale, latency control, and data residency/compliance.
- GPU scarcity and cold starts are real constraints that shape roadmaps and SLAs.
Where to Start
A fast, honest path that mirrors how production teams actually onboard:
- Run
vllm serveon one GPU (or a rented cloud GPU) and call it with the OpenAI client. Feel TTFT and TPOT yourself. - Containerize it, then deploy it with the KServe
LLMInferenceServiceabove on a managed cluster (EKS/GKE/AKS). - Add the NVIDIA GPU Operator and a Grafana dashboard with DCGM + vLLM metrics. Watch utilization while you load-test.
- Add Kueue and a second workload; watch utilization climb and head-of-line blocking disappear.
- Only now reach for the advanced toys: MIG partitioning, scale-to-zero, canary rollouts, and (for 70B+) multi-node
llm-dserving.
That order front-loads the parts that teach you the system and back-loads the parts that only matter at scale.
Gaps and What to Watch
To keep this a “quick look,” several real topics were compressed or left out — flagged here honestly:
- Training-side depth (data, tensor, pipeline parallelism; checkpointing; multi-node fault tolerance) gets one paragraph, not the chapter it deserves.
- Non-LLM serving (classic ML, vision, recommenders via Triton or the standard KServe
InferenceService) is barely touched. - Networking fabric (NVLink, InfiniBand, RDMA) is the hidden hero of multi-GPU performance and only appears as a footnote.
- Security and compliance (prompt-injection at the infra boundary, tenant isolation, audit trails) deserve their own treatment.
- Vendor numbers are vendor numbers. NVIDIA’s cost-per-token and throughput claims come from its own materials and cited benchmarks; validate against your workload before planning budgets (NVIDIA AI inference).
- The stack is young.
llm-dand disaggregated serving are rapidly evolving in 2026; expect APIs and best practices to shift (KServe).
Good follow-up articles from here: a hands-on multi-node 70B deployment with llm-d; a managed-vs-self-hosted (SageMaker/Vertex vs Kubernetes) decision guide; and a deep dive on GPU cost attribution with Kubecost/OpenCost.
Sources
Inference and serving
- vLLM documentation
- KServe: Understanding LLMInferenceService
- KServe v0.17 release notes
- The 2026 AI/ML on Kubernetes Stack (vLLM, Kueue, KServe, llm-d, Ray)