TokenSpeed, a high-performance LLM inference engine designed specifically for agentic workloads, was released by LightSeek Foundation on May 6, 2026, after just two months of development starting mid-March. The project gained 945 GitHub stars within a week, demonstrating strong community interest in its performance claims and open-source approach.
TokenSpeed Achieves 9-11% Performance Gains Over TensorRT-LLM
On NVIDIA B200 GPUs, TokenSpeed outperforms TensorRT-LLM by approximately 9% in minimum latency and 11% in throughput at 100 transactions per second per user on Kimi K2.5 models. The TokenSpeed MLA (Multi-head Latent Attention) kernel nearly halves decode latency compared to TensorRT-LLM on speculative decoding workloads, with improvements of 9-11% across various coding-agent scenarios, particularly in configurations prioritizing per-user latency requirements of 70+ tokens per second.
Benchmark results:
- 9% improvement in minimum latency on NVIDIA B200
- 11% improvement in throughput at 100 TPS/User
- Nearly 50% reduction in decode latency for speculative decoding workloads
- Consistent performance gains across coding-agent scenarios
Five-Pillar Technical Architecture Optimizes for Agentic AI
TokenSpeed's architecture is built on five core design pillars:
- Compiler-backed modeling mechanism: Uses a local SPMD design with automatic generation of collective operations for distributed inference
- High-performance scheduler: Separates control and execution planes, with a C++ finite-state machine managing resource allocation and KV cache state
- Safe KV resource reuse: Enforces correct cache management through compile-time verification rather than runtime conventions
- Pluggable kernel system: Supports heterogeneous accelerators with modular, plugin-based architecture, including optimized MLA kernels for NVIDIA Blackwell
- SMG integration: Provides a low-overhead CPU-side request entrypoint
vLLM Adopts TokenSpeed as Day-0 Launch Partner
The vLLM project announced an exclusive day-0 launch partner integration with TokenSpeed, with the TokenSpeed MLA kernel already adopted by vLLM. Released under MIT license, TokenSpeed combines TensorRT-LLM-level performance with vLLM-level usability, making state-of-the-art inference performance accessible to the open-source community.
The engine was developed by a lean, mission-driven team in just two months, with the engine and kernels remaining under active development. Production hardening is planned over the next month as the project moves toward stability.
Rapid Development Timeline Demonstrates Technical Excellence
The rapid two-month development timeline from mid-March to May 6, 2026, combined with immediate vLLM adoption and 945 GitHub stars in the first week, demonstrates both technical excellence and strong product-market fit for agentic AI workloads. TokenSpeed's focus on per-user latency requirements and agentic scenarios differentiates it from general-purpose inference engines, addressing a growing need as AI applications shift toward autonomous agent architectures.
Key Takeaways
- TokenSpeed outperforms TensorRT-LLM by 9% in latency and 11% in throughput on NVIDIA B200 for agentic workloads
- The project gained 945 GitHub stars within one week of release on May 6, 2026
- Development took just two months from mid-March 2026, demonstrating rapid execution by a lean team
- vLLM adopted TokenSpeed as an exclusive day-0 launch partner, with the MLA kernel already integrated
- Released under MIT license, combining TensorRT-LLM performance with vLLM usability for the open-source community