Inference Optimization (Latency + Throughput + Cost)

The mental model

Inference optimization is a 3-layer game: Model × Service × Hardware. And for autoregressive LLMs you’re optimizing two different systems:

Prefill (TTFT): compute-bound (parallelizable)
Decode (TPOT): memory-bandwidth-bound (sequential)

If you don’t separate these, you’ll “optimize” the wrong thing.

1) What to optimize: the metrics that matter to CTOs

The 4 metrics that decide success

TTFT (Time to First Token) → perceived responsiveness
TPOT / ITL (Time per Output Token) → streaming speed
Throughput (Tokens/sec) → cost driver
Goodput = requests/sec that meet SLOs (prevents “throughput wins” that ruin UX)

Heuristic: Optimize for goodput, not raw TPS. That’s how you avoid “it’s fast but unusable.”

2) Foundational bottleneck: the KV cache (the silent GPU killer)

Why KV cache dominates

KV cache prevents recompute during decoding, but it grows with:
- batch size
- sequence length
- layers × hidden size
- precision

KV Cache size: 2 * batch_size * seq_len * num_layers * hidden_size * sizeof(dtype)

Heuristic: Most “why is my 70B slow/expensive?” problems are really KV cache pressure problems.

3) The optimization ladder (highest ROI first)

Start here (most teams get 70% of wins)

Use a modern inference server (vLLM/TGI/TRT-LLM)
Continuous batching (keep GPU busy under concurrency)
KV cache efficiency (PagedAttention-like memory management)
Quantization (INT8/INT4) to reduce bandwidth pressure

Then escalate to: 5) Fused attention kernels (FlashAttention) (esp. long context) 6) Prompt/prefix caching (reduce TTFT for repeated prefixes) 7) Speculative decoding (lower TPOT/latency; adds complexity) 8) Parallelism (tensor/pipeline) if model doesn’t fit on one GPU

4) Map symptoms → fixes (the table you’ll actually use)

Symptom	Likely root cause	Primary fixes	Trade-off
High TTFT (slow to start)	long prompt prefill	Prompt caching, reduce prompt length, FlashAttention	cache memory + infra complexity
High TPOT (slow streaming)	decode bandwidth bound	Quantization, speculative decoding, MQA/GQA models	quality risk + complexity
High GPU bill / low TPS	poor utilization	Continuous batching, PagedAttention, raise batch size carefully	latency vs throughput tradeoff
Model doesn’t fit	weights + KV exceed VRAM	quantize, tensor parallelism	comm overhead, engineering time
RAG slow + costly	long context & repeated prefixes	prompt caching + compression + batching	caching correctness + invalidation effort

Heuristic: If TTFT is bad, don’t waste time on decode tricks. If TPOT is bad, caching won’t save you.

5) Model-level optimizations (change the model)

Quantization (low-hanging fruit)

PTQ: fast, no retraining (risk: accuracy loss with naïve PTQ on LLMs)
QAT: more accurate but requires training
Transformer quantization is tricky due to activation outliers; methods like:
- SmoothQuant (shift difficulty from activations to weights)
- GPTQ (layer-wise PTQ using second-order info)
- mixed precision approaches (keep outliers in FP16)

CTO heuristic: Quantize first, then re-run your app eval suite (Chapter 6). If business metrics don’t degrade, it’s free money.

Pruning & sparsity

Unstructured sparsity often doesn’t speed up without specialized kernels.
Structured / N:M sparsity can be hardware-accelerated (e.g., 2:4) but increases complexity.

Distillation

Train a smaller model to mimic a teacher’s outputs (“dark knowledge”).
High ROI when you need to serve huge volume at low cost.

6) Attention/KV architecture optimizations (reduce cache pain)

MHA vs MQA vs GQA

MQA: shared K/V across heads → smallest KV cache, fastest decode (possible quality drop)
GQA: compromise: groups share K/V (near-MHA quality, smaller cache)
Many large models adopt GQA (e.g., Llama 2 70B)

PagedAttention (vLLM-style KV memory paging)

KV cache stored in fixed-size blocks → reduces fragmentation
Enables:
- much larger effective batch sizes
- prefix sharing (copy-on-write)

Heuristic: If you serve high concurrency, PagedAttention + continuous batching is the “default modern stack.”

FlashAttention (fused kernel)

Minimize HBM reads/writes by fusing operations and using SRAM tiling.
Big wins for long context.

7) Break the autoregressive bottleneck (advanced)

Speculative decoding

Draft model proposes k tokens; target model verifies in parallel.
Speedup depends on acceptance rate.

When to use: you’ve already implemented batching + KV optimizations, and TPOT is still limiting UX.

Parallel decoding (e.g., Medusa-like)

Multiple decoding heads predict future tokens; verify via tree-style selection.
Higher complexity; emerging but powerful.

8) Service-level optimization (how you schedule + serve)

Continuous batching (state-of-the-art)

Keeps the batch full by swapping in new requests as old ones finish.
Delivers major throughput gains under real traffic.

Parallelism (when you must scale across GPUs)

Tensor parallelism: split matrices within a layer (good latency; needs fast interconnect)
Pipeline parallelism: split layers across GPUs (fits big models; more latency/bubbles)
Sequence parallelism: partition along sequence dim to reduce activation memory

Prompt/prefix caching

Cache KV for shared prefixes (system prompts, long instructions, repeated contexts).
Reduces TTFT and cost.

Heuristic: In enterprise assistants, prompt caching is often the single biggest TTFT win.

9) A practical decision flowchart (use this in architecture reviews)

10) AWS-first “default stack” for optimized serving (practical)

MVP (fastest to ship)

Managed model API (e.g., Bedrock/SageMaker endpoint)
Focus on routing + caching + context reduction
Track TTFT/TPOT + cost per successful task

Self-hosted (performance-first)

EKS or ECS GPU + vLLM (or TGI / TRT-LLM)
Prometheus/Grafana + CloudWatch for infra metrics
Explicit SLOs: TTFT, TPOT, goodput, error budget
Canary rollouts + fallbacks

Heuristic: Self-hosting only wins if you can keep GPUs highly utilized and you’ve done the above optimizations.

“If you remember 8 things”

Prefill ≠ decode. Measure TTFT and TPOT separately.
KV cache drives memory pressure and cost.
Optimize for goodput, not raw TPS.
Biggest wins: continuous batching + KV paging + quantization.
FlashAttention helps long contexts; caching helps repeated prefixes.
Speculative decoding is an “advanced lever” after the basics.
Parallelism is a tax—pay it only when the model doesn’t fit.
Every optimization must be validated with your app eval suite (quality regressions are common).