SpecKV: Adaptive Speculative Decoding Achieves 56% Speed Improvement for Compressed LLMs

SpecKV Delivers 56% Speed Boost for Compressed Language Models

Researcher Shikhar Shukla from the University of Kentucky published SpecKV on arXiv on May 4, 2026, introducing an adaptive approach to speculative decoding that achieves a 56% improvement over fixed-parameter baselines. The research addresses a critical gap in how speculative decoding interacts with model compression techniques like quantization.

Speculative decoding accelerates LLM inference by using a small draft model to propose candidate tokens that a larger target model verifies in parallel. Nearly all existing systems use a fixed speculation length (γ = 4), determining how many tokens the draft model proposes per step. SpecKV demonstrates that this one-size-fits-all approach leaves significant performance on the table.

Optimal Parameters Vary Across Compression Levels

Shukla's methodology profiled speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels: FP16 (full precision), INT8 (8-bit quantization), and NF4 (4-bit NormalFloat). The research collected 5,112 step-level records tracking per-step acceptance rates, draft entropy, and draft confidence.

The data revealed that optimal speculation length shifts significantly across compression regimes. More compressed models benefit from different speculation strategies than full-precision models. Draft model confidence and entropy emerged as strong predictors of acceptance rate, with correlations around 0.56.

Lightweight MLP Enables Real-Time Adaptation

SpecKV uses a small multilayer perceptron trained on draft confidence and entropy signals to dynamically select speculation length. This adaptive approach maximizes expected tokens per speculation step while adding only 0.34ms overhead per decision—less than 0.5% of total step time.

The 56% improvement over fixed γ=4 baseline proved statistically significant (p < 0.001, paired bootstrap test). All profiling data, trained models, and analysis notebooks have been released as open-source artifacts.

Key Takeaways

SpecKV achieves 56% speed improvement over fixed speculation length baselines with only 0.34ms overhead per decision
Optimal speculation length varies significantly across model compression levels (FP16, INT8, NF4)
Draft model confidence and entropy predict acceptance rates with 0.56 correlation
The research profiled 5,112 step-level records across 4 task categories and 3 compression levels
All code, data, and trained models released open-source on arXiv

SpecKV Delivers 56% Speed Boost for Compressed Language Models

Optimal Parameters Vary Across Compression Levels

Lightweight MLP Enables Real-Time Adaptation

Key Takeaways

SpecKV achieves 56% speed improvement over fixed speculation length baselines with only 0.34ms overhead per decision

Optimal speculation length varies significantly across model compression levels (FP16, INT8, NF4)

Draft model confidence and entropy predict acceptance rates with 0.56 correlation

The research profiled 5,112 step-level records across 4 task categories and 3 compression levels

All code, data, and trained models released open-source on arXiv