LLM Inference
An opinionated and incomplete survey of LLM inference and serving runtimes from a systems and infrastructure lens.
-
LLMs and Transformers Introduction, embeddings, transformers and attention mechanisms
-
Inference and the KV Cache Inference execution and the KV cache
-
Sharding a Model Pipeline, tensor, and expert parallelism
-
Batching, Scheduling, and Paging Continuous batching, Orca, and PagedAttention
-
I/O-Aware Kernels FlashAttention and FlashInfer
-
Speculative Decoding Speculative decoding, EAGLE, Medusa Trees, and Multi-Token Prediction
-
Prefill-Decode Scheduling and Disaggregation Chunk prefill and prefill-decode disaggregation
-
KV Cache Management and Offload Prefix caching and KV offload
-
Appendix: Overview of Training Fine-tuning, RLHF, RLAIF, quantization, and alignment techniques
-
Appendix: GPU Hardware Architecture, CUDA and ROCm, kernels and Triton, memory hierarchy
-
Appendix: Inference Runtimes LLM Serving Stacks, TensorRT, Triton, vLLM, and SGLang