LLM Inference

An opinionated and incomplete survey of LLM inference and serving runtimes from a systems and infrastructure lens.

LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
Inference and the KV Cache Inference execution and the KV cache
Sharding a Model Pipeline, tensor, and expert parallelism
Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
I/O-Aware Kernels FlashAttention and FlashInfer
Speculative Decoding Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction
Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
KV Cache Management and Offload Prefix caching and KV offload
Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang