Lightseek released TokenSpeed on May 6, 2026, a new LLM inference engine that claims to deliver TensorRT LLM-level performance with vLLM-level usability. The preview release gained 244 stars on GitHub quickly, though the creators explicitly warn against production deployments at this stage.
Preview Release Targets High-Performance Inference Market
TokenSpeed positions itself with three core value propositions: TensorRT LLM-level performance, vLLM-level ease of use, and rapid development by a lean team in just two months. The "speed-of-light" branding indicates aggressive optimization for inference throughput, targeting a market currently dominated by established tools like vLLM and TensorRT-LLM.
The project is released under the MIT open source license with a Python-dominant codebase (92.7%) and C++ components (6.8%). The architecture is organized into modular subsystems including kernel, multi-head latency attention (MLA), and scheduler components.
Current Capabilities and Development Roadmap
The preview release currently demonstrates performance with the Kimi K2.5 model running on Blackwell B200 hardware. The development roadmap includes several in-progress features:
- Model coverage: Qwen 3.6, DeepSeek V4, and MiniMax M2.7
- Runtime features: Pipelined decoding, EPLB, KV stores, Mamba caching, VLM support, and metrics
- Hardware optimization: Improvements for Hopper and MI350 platforms
Comprehensive documentation is available at lightseek.org/tokenspeed/, covering getting started guides, server launch instructions, model recipes, and configuration parameters.
Positioning in Competitive Inference Landscape
As of April 2026, the LLM inference market shows Cerebras leading on open-source model speed at 2,100 tokens per second for Llama 3.1 70B, with SambaNova at 580 tokens per second. TokenSpeed's entry into this space represents a potential disruption from a smaller team challenging established players.
Lightseek also develops the Shepherd Model Gateway (SMG), a high-performance load balancer for LLM inference, indicating the organization's broader focus on inference infrastructure optimization.
Key Takeaways
- TokenSpeed was released as a preview on May 6, 2026, gaining 244 GitHub stars and explicitly warning against production use
- The engine claims to deliver TensorRT LLM-level performance with vLLM-level usability, built by a lean team in two months
- Currently demonstrated with Kimi K2.5 on Blackwell B200 hardware, with roadmap support for Qwen 3.6, DeepSeek V4, and MiniMax M2.7
- The codebase is 92.7% Python and 6.8% C++, released under MIT license with modular architecture for kernel, MLA, and scheduler components
- Lightseek positions TokenSpeed in a competitive market where Cerebras leads at 2,100 tokens per second for Llama 3.1 70B