Developer HJCheng0602 released nanoPD on April 11, 2026, a from-scratch Python inference engine implementing Prefill/Decode disaggregation for large language models. The educational project has accumulated 143 GitHub stars and serves as a reference implementation for understanding a technique now supported across every major open-source inference framework in 2026.
Prefill/Decode Disaggregation Separates LLM Inference Stages
Prefill/Decode disaggregation separates two key stages of LLM inference into different hardware resource pools. The prefill stage processes the entire input sequence in parallel, storing key-value vectors from attention layers in a KV cache—a compute-intensive operation that benefits from parallel processing. The decode stage generates tokens one by one using the cached KV states, a memory-intensive sequential operation.
By separating these stages onto different GPUs, disaggregated serving can optimize resource utilization. Together AI published research in March 2026 showing cache-aware prefill-decode disaggregation delivering up to 40% higher sustainable throughput for long-context inference.
Custom CUDA Kernels and Adaptive Routing
nanoPD implements several advanced features:
- Custom paged KV cache similar to vLLM's PagedAttention
- Chunked prefill that breaks down long sequences for better resource utilization
- CUDA paged attention kernel for efficient attention computation with paged memory
- Multi-GPU KV transfer between prefill and decode stages
- Adaptive router that dynamically switches between collocated, disaggregated, and adaptive modes based on workload characteristics
The project supports three operational modes: collocated (prefill and decode on same GPU), disaggregated (separate GPUs), and adaptive (dynamic switching).
Educational Value in Production-Ready Landscape
As of early 2026, disaggregation is supported across every major open-source inference framework, with Meta, LinkedIn, Mistral, and Hugging Face among organizations running disaggregated serving in production. nanoPD's from-scratch Python implementation provides an accessible reference for understanding the technique's internals, despite production systems typically using more optimized implementations.
Key Takeaways
- nanoPD is a from-scratch Python implementation of Prefill/Decode disaggregation for LLMs, released on April 11, 2026, with 143 GitHub stars
- The engine separates compute-intensive prefill (parallel input processing) from memory-intensive decode (sequential token generation) onto different GPUs
- Custom features include paged KV cache, chunked prefill, CUDA paged attention kernels, multi-GPU KV transfer, and an adaptive router
- The project supports three modes: collocated (traditional), disaggregated (separate GPUs), and adaptive (dynamic switching based on workload)
- While disaggregation is now production-ready across major frameworks, nanoPD serves as an educational reference implementation with Together AI reporting up to 40% throughput improvements in March 2026