TokenSpeed: Open-Source LLM Inference Engine Matches TensorRT-LLM Performance, Halves Decode Latency

LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine that matches TensorRT-LLM performance while maintaining vLLM-level usability. Developed in just two months starting mid-March 2026, the MIT-licensed project demonstrates significant performance gains on NVIDIA B200 GPUs, with its Multi-head Latent Attention (MLA) kernel nearly halving decode latency compared to TensorRT-LLM on speculative decoding workloads.

Performance Benchmarks Show Double-Digit Improvements

Testing on NVIDIA B200 hardware revealed TokenSpeed outperforms TensorRT-LLM by approximately 9% in minimum latency and 11% in throughput at 100 transactions per second per user on Kimi K2.5. The project's MLA kernel has already been adopted by vLLM, demonstrating its value to the broader inference ecosystem. TokenSpeed is specifically optimized for agentic workloads, a growing use case as AI systems become more autonomous.

Five-Pillar Architecture Enables High Performance

TokenSpeed's architecture is built around five core design principles:

Compiler-backed modeling mechanism for parallelism optimization
High-performance scheduler with C++ control plane and Python execution plane using FSM-based request lifecycle management
Safe KV resource reuse restriction to prevent memory conflicts
Pluggable layered kernel system supporting heterogeneous accelerators
SMG integration for low-overhead CPU-side request entrypoint

The project was developed in collaboration with NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, Mooncake, LongCat, FluentLLM, EvalScope, and NVIDIA Dynamo.

Production Hardening Underway

The GitHub repository has accumulated 837 stars since its release. The project is currently under heavy development, with production hardening planned over the next month. As an MIT-licensed project, TokenSpeed provides an accessible alternative to proprietary inference solutions while delivering competitive performance for demanding AI workloads.

Key Takeaways

TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput on NVIDIA B200 GPUs
The MLA kernel nearly halves decode latency versus TensorRT-LLM on speculative decoding workloads
Developed in just two months starting mid-March 2026 by LightSeek Foundation under MIT license
The MLA kernel has already been adopted by vLLM, demonstrating cross-project value
Production hardening is planned over the next month, with the project currently under heavy development

Performance Benchmarks Show Double-Digit Improvements

Five-Pillar Architecture Enables High Performance

TokenSpeed's architecture is built around five core design principles:

Compiler-backed modeling mechanism for parallelism optimization

High-performance scheduler with C++ control plane and Python execution plane using FSM-based request lifecycle management

Safe KV resource reuse restriction to prevent memory conflicts

Pluggable layered kernel system supporting heterogeneous accelerators

SMG integration for low-overhead CPU-side request entrypoint

The project was developed in collaboration with NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, Mooncake, LongCat, FluentLLM, EvalScope, and NVIDIA Dynamo.

Production Hardening Underway

Key Takeaways

TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput on NVIDIA B200 GPUs

The MLA kernel nearly halves decode latency versus TensorRT-LLM on speculative decoding workloads

Developed in just two months starting mid-March 2026 by LightSeek Foundation under MIT license

The MLA kernel has already been adopted by vLLM, demonstrating cross-project value

Production hardening is planned over the next month, with the project currently under heavy development