Saguaro Algorithm Achieves 2x Speedup Over Standard Speculative Decoding

Researchers Tanishq Kumar, Tri Dao, and Avner May have published a new paper introducing Speculative Speculative Decoding (SSD), an optimization technique that parallelizes both speculation and verification operations in LLM inference. Their implementation, called Saguaro, achieves up to 2x faster performance than optimized speculative decoding baselines and up to 5x faster than autoregressive decoding.

How Speculative Speculative Decoding Works

Standard speculative decoding uses a fast draft model to predict upcoming tokens from a slower target model, then verifies them in parallel with a single forward pass. SSD goes further by eliminating the sequential dependency between drafting and verification. While verification is ongoing, the draft model predicts likely verification outcomes and prepares speculations pre-emptively for them. If the actual verification outcome matches a predicted set, a speculation can be returned immediately, eliminating drafting overhead entirely.

The authors identify three key challenges with this approach and present principled methods to solve each, resulting in the Saguaro algorithm.

Performance and Availability

The research demonstrates significant performance improvements:

Up to 2x faster than optimized speculative decoding baselines
Up to 5x faster than autoregressive decoding
Compatible with open source inference engines

The technique works by breaking the sequential dependencies inherent in traditional speculative decoding (draft → verify → draft again) through overlapping speculation with verification.

Why This Matters for LLM Deployment

Inference cost remains a major bottleneck for LLM deployment at scale. While speculative decoding has become a standard optimization technique, it still contains sequential dependencies that limit performance. SSD represents a fundamental architectural improvement rather than just an implementation optimization.

As model sizes continue growing and inference costs become increasingly important for production deployments, a 2x improvement over already-optimized speculative decoding could translate to significant cost savings. The technique follows recent trends in inference optimization focused on latency reduction without quality degradation, including parallel decoding, multi-token prediction, and other speculative methods.

Key Takeaways

Speculative Speculative Decoding (SSD) parallelizes both speculation and verification operations, eliminating sequential dependencies in traditional speculative decoding
The Saguaro algorithm achieves up to 2x speedup over optimized speculative decoding baselines and up to 5x speedup over autoregressive decoding
The technique works with open source inference engines, making it accessible for production deployments
SSD represents a fundamental architectural improvement in LLM inference optimization, addressing the growing importance of inference costs as model sizes increase
The research was published by Tanishq Kumar, Tri Dao, and Avner May on arXiv on March 3, 2026

How Speculative Speculative Decoding Works

The authors identify three key challenges with this approach and present principled methods to solve each, resulting in the Saguaro algorithm.

Performance and Availability

The research demonstrates significant performance improvements:

Up to 2x faster than optimized speculative decoding baselines

Up to 5x faster than autoregressive decoding

Compatible with open source inference engines

The technique works by breaking the sequential dependencies inherent in traditional speculative decoding (draft → verify → draft again) through overlapping speculation with verification.

Why This Matters for LLM Deployment

Key Takeaways

Speculative Speculative Decoding (SSD) parallelizes both speculation and verification operations, eliminating sequential dependencies in traditional speculative decoding

The Saguaro algorithm achieves up to 2x speedup over optimized speculative decoding baselines and up to 5x speedup over autoregressive decoding

The technique works with open source inference engines, making it accessible for production deployments

SSD represents a fundamental architectural improvement in LLM inference optimization, addressing the growing importance of inference costs as model sizes increase

The research was published by Tanishq Kumar, Tri Dao, and Avner May on arXiv on March 3, 2026