LightSeek Foundation has released TokenSpeed, an open-source LLM inference engine that matches TensorRT-LLM performance while maintaining vLLM-level usability. Developed in just two months starting mid-March 2026, the MIT-licensed project demonstrates significant performance gains on NVIDIA B200 GPUs, with its Multi-head Latent Attention (MLA) kernel nearly halving decode latency compared to TensorRT-LLM on speculative decoding workloads.
Performance Benchmarks Show Double-Digit Improvements
Testing on NVIDIA B200 hardware revealed TokenSpeed outperforms TensorRT-LLM by approximately 9% in minimum latency and 11% in throughput at 100 transactions per second per user on Kimi K2.5. The project's MLA kernel has already been adopted by vLLM, demonstrating its value to the broader inference ecosystem. TokenSpeed is specifically optimized for agentic workloads, a growing use case as AI systems become more autonomous.
Five-Pillar Architecture Enables High Performance
TokenSpeed's architecture is built around five core design principles:
- Compiler-backed modeling mechanism for parallelism optimization
- High-performance scheduler with C++ control plane and Python execution plane using FSM-based request lifecycle management
- Safe KV resource reuse restriction to prevent memory conflicts
- Pluggable layered kernel system supporting heterogeneous accelerators
- SMG integration for low-overhead CPU-side request entrypoint
The project was developed in collaboration with NVIDIA DevTech, AMD Triton, Qwen Inference, Together AI, Mooncake, LongCat, FluentLLM, EvalScope, and NVIDIA Dynamo.
Production Hardening Underway
The GitHub repository has accumulated 837 stars since its release. The project is currently under heavy development, with production hardening planned over the next month. As an MIT-licensed project, TokenSpeed provides an accessible alternative to proprietary inference solutions while delivering competitive performance for demanding AI workloads.
Key Takeaways
- TokenSpeed outperforms TensorRT-LLM by ~9% in min-latency and ~11% in throughput on NVIDIA B200 GPUs
- The MLA kernel nearly halves decode latency versus TensorRT-LLM on speculative decoding workloads
- Developed in just two months starting mid-March 2026 by LightSeek Foundation under MIT license
- The MLA kernel has already been adopted by vLLM, demonstrating cross-project value
- Production hardening is planned over the next month, with the project currently under heavy development