INFRASTRUCTURE · RESEARCH

LLM Inference Engineering

A practical series on how large language model inference works in production: tokens, decode loops, caches, attention, batching, memory management, and serving economics.

4 modules · 14 chapters

Scope

Infrastructure Engineer

Audience

Book a complete training session

Contact to book complete training sessions including training slides, hands-on exercises, mini-projects, and capstone projects.

Contact to book →

Curriculum

Foundations

How tokens, the first forward pass, and step-by-step generation set up the rest of inference.

  • You Hit Enter
  • Tokens - The LLM's Alphabet
  • One Word at a Time
  • Prefill vs Decode

Memory, Attention, and the Bottleneck

KV cache behavior, attention mechanics, and the hardware bottleneck that shapes decode.

  • The KV Cache
  • Attention at Inference Time
  • The Memory Wall

Throughput, Batching, and Memory Management

Serving mechanics that raise utilization without breaking latency or memory limits.

  • Continuous Batching
  • PagedAttention
  • FlashAttention

Efficiency, Compression, and Economics

Faster decode, smaller caches, cheaper tokens, and the trade-offs that shape serving cost.

  • Speculative Decoding
  • Quantization
  • The Economics
  • Epilogue - The Full Journey

Sample Slides

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8