SpecKV Delivers 56% Speed Boost for Compressed Language Models
Researcher Shikhar Shukla from the University of Kentucky published SpecKV on arXiv on May 4, 2026, introducing an adaptive approach to speculative decoding that achieves a 56% improvement over fixed-parameter baselines. The research addresses a critical gap in how speculative decoding interacts with model compression techniques like quantization.
Speculative decoding accelerates LLM inference by using a small draft model to propose candidate tokens that a larger target model verifies in parallel. Nearly all existing systems use a fixed speculation length (γ = 4), determining how many tokens the draft model proposes per step. SpecKV demonstrates that this one-size-fits-all approach leaves significant performance on the table.
Optimal Parameters Vary Across Compression Levels
Shukla's methodology profiled speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels: FP16 (full precision), INT8 (8-bit quantization), and NF4 (4-bit NormalFloat). The research collected 5,112 step-level records tracking per-step acceptance rates, draft entropy, and draft confidence.
The data revealed that optimal speculation length shifts significantly across compression regimes. More compressed models benefit from different speculation strategies than full-precision models. Draft model confidence and entropy emerged as strong predictors of acceptance rate, with correlations around 0.56.
Lightweight MLP Enables Real-Time Adaptation
SpecKV uses a small multilayer perceptron trained on draft confidence and entropy signals to dynamically select speculation length. This adaptive approach maximizes expected tokens per speculation step while adding only 0.34ms overhead per decision—less than 0.5% of total step time.
The 56% improvement over fixed γ=4 baseline proved statistically significant (p < 0.001, paired bootstrap test). All profiling data, trained models, and analysis notebooks have been released as open-source artifacts.
Key Takeaways
- SpecKV achieves 56% speed improvement over fixed speculation length baselines with only 0.34ms overhead per decision
- Optimal speculation length varies significantly across model compression levels (FP16, INT8, NF4)
- Draft model confidence and entropy predict acceptance rates with 0.56 correlation
- The research profiled 5,112 step-level records across 4 task categories and 3 compression levels
- All code, data, and trained models released open-source on arXiv