Deployment & Serving (From prototype to production LLM infra)

The mental model

Serving an LLM is a cost-and-latency business disguised as an ML problem. Your job is to maximize: tokens/sec/$ while meeting TTFT + TPOT SLOs and reliability.

1) The first strategic decision: API vs Self-hosting

CTO-grade trade-off table (the only one you need)

Vector	API (OpenAI/Anthropic/Bedrock-like)	Self-host (open weights)
Unit economics	predictable OpEx, linear scaling	lower variable cost at scale, but pay for idle GPUs
Latency control	black box + network	full control: batching, kernels, routing
Customization	limited	deep: fine-tune + serving optimizations
Privacy/security	data leaves boundary (policy dependent)	keep data in VPC / controlled env
Ops complexity	low	high: GPUs, autoscaling, rollout, monitoring
Speed to market	fastest	slower initial setup

Heuristics

Start with API to validate value fast.
Consider self-hosting when (a) data boundary demands it, (b) you need strict latency/throughput control, or (c) you’re at meaningful scale and can keep GPUs highly utilized.

Decision path: API vs self-hosting

2) Inference fundamentals you must internalize

Two phases, two bottlenecks

Prefill (prompt processing): parallel, typically compute-bound
Decode (token generation): sequential, typically memory-bandwidth-bound

Metrics that matter (production)

TTFT (time to first token): “feels responsive?”
TPOT (time per output token): “streaming speed”
Total latency = TTFT + TPOT * (# output tokens)
Throughput: tokens/sec across all requests (primary efficiency metric)

Heuristic: Your product may be decode-heavy (chat) or prefill-heavy (summarization). Optimize accordingly.

Request lifecycle

3) Performance engineering: what actually moves the needle

Primary bottleneck: memory bandwidth (most real-time serving)

Track:

MBU (Model Bandwidth Utilization): how close you are to saturating VRAM bandwidth
MFU (Model FLOPs Utilization): more relevant for compute-heavy prefill/large batches

Heuristic: If MBU is low, you likely have software/serving inefficiency (batching/KV-cache/paging), not “need a bigger GPU”.

The “serving optimization stack” (in order)

Continuous batching (biggest throughput win under concurrency)
KV-cache efficiency (PagedAttention-like approaches)
Quantization (INT8/INT4) to reduce memory traffic
FlashAttention / optimized kernels (long-context wins)
Speculative decoding (latency wins; watch throughput trade-off)
Tensor/Pipeline parallelism (when model doesn’t fit on one GPU)

Heuristic: Most teams should get 70% of wins from (1) batching + (2) KV cache + (3) quantization.

4) Cost modeling: avoid “success = bankruptcy”

API cost reality

Costs scale linearly with usage (great for MVP, scary at scale)
Output tokens often cost far more than input tokens

Self-hosting cost reality

GPU is a fixed hourly cost → utilization determines unit economics
You must budget VRAM for weights + KV cache (concurrency)

Heuristic: Self-hosting only “wins” financially when you can keep GPUs busy and your stack is optimized for high tokens/sec.

5) The modern self-hosted serving stack

Choose an inference engine (core performance layer)

Engine	Strength	Best for
vLLM	throughput + KV cache efficiency (PagedAttention)	high-concurrency serving
HF TGI	production robustness + prefix caching + guided output	RAG/tool apps, long prompts
TensorRT-LLM (+ Triton)	maximum performance on NVIDIA via compilation/kernel fusion	ultra-low latency / enterprise perf

Choose an orchestration layer

Kubernetes + KServe: best if you already run K8s (autoscaling, canary, model mgmt)
Ray Serve: Python-native “model composition” scaling (great for multi-model pipelines)

Serving stack layers

6) The “cold start” problem (why autoscaling is hard)

Cold starts can take minutes because:

provisioning instances
pulling images/weights
loading weights into GPU memory

Mitigations

keep a warm pool of standby replicas
optimize image/weight fetching (streaming / caching / P2P)
use memory-mapped weight formats where applicable

Heuristic: For interactive apps, treat scale-from-zero as an exception path, not the default.

7) Production rollout strategy (how leaders avoid outages)

Minimum viable reliability controls

strict timeouts (TTFT and total)
bounded retries (with backoff)
circuit breaker on provider/cluster failures
graceful fallback (smaller model / API / cached answer)
load shedding (reject early vs melt down)

Rollouts

Canary: 1–5% traffic, watch TTFT/TPOT/error/cost
Shadow: mirror traffic to new stack, don’t serve results
Blue/Green: switch over when stable

8) Three-horizon strategy (the sane roadmap)

Horizon 1 (Now): MVP

use APIs/managed models
instrument TTFT/TPOT + cost per successful task
build eval flywheel (Chapter 6)

Horizon 2 (Next 6–12 months)

add routing cascades (cheap vs strong)
start benchmarking self-hosted baselines
harden observability and perf testing

Horizon 3 (12+ months)

self-host if TCO + moat justify it
invest in deep serving optimization + custom model strategy

Migration roadmap: Three horizons

9) AWS-first mapping (default, framework-agnostic)

Fast path (managed-ish):

Bedrock / SageMaker endpoints for inference
API Gateway + WAF at edge
CloudWatch + X-Ray + LangSmith (if LangChain) for tracing
Auto scaling + alarms on TTFT/TPOT

Self-host path:

EKS (KServe) or ECS on GPU instances
inference engine pods/services (vLLM/TGI/TRT-LLM)
ECR for images, S3/EFS/FSx for weights (with caching strategy)
Prometheus/Grafana + CloudWatch integration (or equivalent)

“If you remember 7 things”

Decide API vs self-host using privacy + latency control + utilization economics.
Optimize separately for TTFT (prefill) and TPOT (decode).
Throughput wins come from continuous batching + KV-cache efficiency.
Quantization is often the highest ROI “GPU unlock”.
Cold start is real—design around it.
Reliability is timeouts + fallbacks + rollouts, not “hope”.
Serving is a business lever: tokens/sec/$ is your north star.