nanoPD: From-Scratch Prefill/Decode Disaggregation Engine for LLM Inference Gains 143 Stars

Developer HJCheng0602 released nanoPD on April 11, 2026, a from-scratch Python inference engine implementing Prefill/Decode disaggregation for large language models. The educational project has accumulated 143 GitHub stars and serves as a reference implementation for understanding a technique now supported across every major open-source inference framework in 2026.

Prefill/Decode Disaggregation Separates LLM Inference Stages

Prefill/Decode disaggregation separates two key stages of LLM inference into different hardware resource pools. The prefill stage processes the entire input sequence in parallel, storing key-value vectors from attention layers in a KV cache—a compute-intensive operation that benefits from parallel processing. The decode stage generates tokens one by one using the cached KV states, a memory-intensive sequential operation.

By separating these stages onto different GPUs, disaggregated serving can optimize resource utilization. Together AI published research in March 2026 showing cache-aware prefill-decode disaggregation delivering up to 40% higher sustainable throughput for long-context inference.

Custom CUDA Kernels and Adaptive Routing

nanoPD implements several advanced features:

Custom paged KV cache similar to vLLM's PagedAttention
Chunked prefill that breaks down long sequences for better resource utilization
CUDA paged attention kernel for efficient attention computation with paged memory
Multi-GPU KV transfer between prefill and decode stages
Adaptive router that dynamically switches between collocated, disaggregated, and adaptive modes based on workload characteristics

The project supports three operational modes: collocated (prefill and decode on same GPU), disaggregated (separate GPUs), and adaptive (dynamic switching).

Educational Value in Production-Ready Landscape

As of early 2026, disaggregation is supported across every major open-source inference framework, with Meta, LinkedIn, Mistral, and Hugging Face among organizations running disaggregated serving in production. nanoPD's from-scratch Python implementation provides an accessible reference for understanding the technique's internals, despite production systems typically using more optimized implementations.

Key Takeaways

nanoPD is a from-scratch Python implementation of Prefill/Decode disaggregation for LLMs, released on April 11, 2026, with 143 GitHub stars
The engine separates compute-intensive prefill (parallel input processing) from memory-intensive decode (sequential token generation) onto different GPUs
Custom features include paged KV cache, chunked prefill, CUDA paged attention kernels, multi-GPU KV transfer, and an adaptive router
The project supports three modes: collocated (traditional), disaggregated (separate GPUs), and adaptive (dynamic switching based on workload)
While disaggregation is now production-ready across major frameworks, nanoPD serves as an educational reference implementation with Together AI reporting up to 40% throughput improvements in March 2026

Prefill/Decode Disaggregation Separates LLM Inference Stages

Custom CUDA Kernels and Adaptive Routing

nanoPD implements several advanced features:

Custom paged KV cache similar to vLLM's PagedAttention

Chunked prefill that breaks down long sequences for better resource utilization

CUDA paged attention kernel for efficient attention computation with paged memory

Multi-GPU KV transfer between prefill and decode stages

Adaptive router that dynamically switches between collocated, disaggregated, and adaptive modes based on workload characteristics

The project supports three operational modes: collocated (prefill and decode on same GPU), disaggregated (separate GPUs), and adaptive (dynamic switching).

Educational Value in Production-Ready Landscape

Key Takeaways

nanoPD is a from-scratch Python implementation of Prefill/Decode disaggregation for LLMs, released on April 11, 2026, with 143 GitHub stars

The engine separates compute-intensive prefill (parallel input processing) from memory-intensive decode (sequential token generation) onto different GPUs

Custom features include paged KV cache, chunked prefill, CUDA paged attention kernels, multi-GPU KV transfer, and an adaptive router

The project supports three modes: collocated (traditional), disaggregated (separate GPUs), and adaptive (dynamic switching based on workload)

While disaggregation is now production-ready across major frameworks, nanoPD serves as an educational reference implementation with Together AI reporting up to 40% throughput improvements in March 2026