TokenSpeed: Open-Source LLM Inference Engine Outperforms TensorRT-LLM on Agentic Workloads

TokenSpeed, a high-performance LLM inference engine designed specifically for agentic workloads, was released by LightSeek Foundation on May 6, 2026, after just two months of development starting mid-March. The project gained 945 GitHub stars within a week, demonstrating strong community interest in its performance claims and open-source approach.

TokenSpeed Achieves 9-11% Performance Gains Over TensorRT-LLM

On NVIDIA B200 GPUs, TokenSpeed outperforms TensorRT-LLM by approximately 9% in minimum latency and 11% in throughput at 100 transactions per second per user on Kimi K2.5 models. The TokenSpeed MLA (Multi-head Latent Attention) kernel nearly halves decode latency compared to TensorRT-LLM on speculative decoding workloads, with improvements of 9-11% across various coding-agent scenarios, particularly in configurations prioritizing per-user latency requirements of 70+ tokens per second.

Benchmark results:

9% improvement in minimum latency on NVIDIA B200
11% improvement in throughput at 100 TPS/User
Nearly 50% reduction in decode latency for speculative decoding workloads
Consistent performance gains across coding-agent scenarios

Five-Pillar Technical Architecture Optimizes for Agentic AI

TokenSpeed's architecture is built on five core design pillars:

Compiler-backed modeling mechanism: Uses a local SPMD design with automatic generation of collective operations for distributed inference
High-performance scheduler: Separates control and execution planes, with a C++ finite-state machine managing resource allocation and KV cache state
Safe KV resource reuse: Enforces correct cache management through compile-time verification rather than runtime conventions
Pluggable kernel system: Supports heterogeneous accelerators with modular, plugin-based architecture, including optimized MLA kernels for NVIDIA Blackwell
SMG integration: Provides a low-overhead CPU-side request entrypoint

vLLM Adopts TokenSpeed as Day-0 Launch Partner

The vLLM project announced an exclusive day-0 launch partner integration with TokenSpeed, with the TokenSpeed MLA kernel already adopted by vLLM. Released under MIT license, TokenSpeed combines TensorRT-LLM-level performance with vLLM-level usability, making state-of-the-art inference performance accessible to the open-source community.

The engine was developed by a lean, mission-driven team in just two months, with the engine and kernels remaining under active development. Production hardening is planned over the next month as the project moves toward stability.

Rapid Development Timeline Demonstrates Technical Excellence

The rapid two-month development timeline from mid-March to May 6, 2026, combined with immediate vLLM adoption and 945 GitHub stars in the first week, demonstrates both technical excellence and strong product-market fit for agentic AI workloads. TokenSpeed's focus on per-user latency requirements and agentic scenarios differentiates it from general-purpose inference engines, addressing a growing need as AI applications shift toward autonomous agent architectures.

Key Takeaways

TokenSpeed outperforms TensorRT-LLM by 9% in latency and 11% in throughput on NVIDIA B200 for agentic workloads
The project gained 945 GitHub stars within one week of release on May 6, 2026
Development took just two months from mid-March 2026, demonstrating rapid execution by a lean team
vLLM adopted TokenSpeed as an exclusive day-0 launch partner, with the MLA kernel already integrated
Released under MIT license, combining TensorRT-LLM performance with vLLM usability for the open-source community

TokenSpeed Achieves 9-11% Performance Gains Over TensorRT-LLM

Benchmark results:

9% improvement in minimum latency on NVIDIA B200

11% improvement in throughput at 100 TPS/User

Nearly 50% reduction in decode latency for speculative decoding workloads

Consistent performance gains across coding-agent scenarios

Five-Pillar Technical Architecture Optimizes for Agentic AI

TokenSpeed's architecture is built on five core design pillars:

Compiler-backed modeling mechanism: Uses a local SPMD design with automatic generation of collective operations for distributed inference

High-performance scheduler: Separates control and execution planes, with a C++ finite-state machine managing resource allocation and KV cache state

Safe KV resource reuse: Enforces correct cache management through compile-time verification rather than runtime conventions

Pluggable kernel system: Supports heterogeneous accelerators with modular, plugin-based architecture, including optimized MLA kernels for NVIDIA Blackwell

SMG integration: Provides a low-overhead CPU-side request entrypoint

vLLM Adopts TokenSpeed as Day-0 Launch Partner

Rapid Development Timeline Demonstrates Technical Excellence

Key Takeaways

TokenSpeed outperforms TensorRT-LLM by 9% in latency and 11% in throughput on NVIDIA B200 for agentic workloads

The project gained 945 GitHub stars within one week of release on May 6, 2026

Development took just two months from mid-March 2026, demonstrating rapid execution by a lean team

vLLM adopted TokenSpeed as an exclusive day-0 launch partner, with the MLA kernel already integrated

Released under MIT license, combining TensorRT-LLM performance with vLLM usability for the open-source community