Researchers have developed AutoTTS, a framework that automatically discovers test-time scaling strategies for large language models, completing the entire discovery process for just $39.90 in 160 minutes. The discovered strategies improve accuracy-cost tradeoffs over manually designed baselines and generalize to new benchmarks and model scales.
Manual Design of Test-Time Scaling Leaves Performance on the Table
Test-time scaling (TTS) improves large language model performance by allocating additional computation during inference rather than relying solely on training-time capabilities. However, existing TTS strategies are largely hand-crafted, with researchers manually designing reasoning patterns and tuning heuristics by intuition. This approach leaves much of the computation-allocation space unexplored.
AutoTTS Shifts Design from Heuristics to Discovery Environments
The researchers propose an environment-driven framework that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key lies in environment construction—the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search.
The team formulated width-depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals. Controllers decide when to branch, continue, probe, prune, or stop, and can be evaluated cheaply without repeated LLM calls. They introduced beta parameterization to make the search tractable and fine-grained execution trace feedback to help the agent diagnose why a TTS program fails.
Discovered Strategies Outperform Manual Baselines and Generalize
Experiments on mathematical reasoning benchmarks demonstrate that the discovered strategies improve the overall accuracy-cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and different model scales, indicating they capture fundamental principles rather than overfitting to specific test conditions.
The entire discovery process cost only $39.90 and completed in 160 minutes, making the approach accessible to researchers without extensive computational resources. The code is publicly available, enabling other researchers to apply the framework to different domains and tasks.
Key Takeaways
- AutoTTS automatically discovers test-time scaling strategies for LLMs, shifting design focus from individual heuristics to discovery environments
- The framework formulates width-depth TTS as controller synthesis over pre-collected reasoning trajectories with cheap evaluation
- Discovered strategies improve accuracy-cost tradeoffs over manually designed baselines on mathematical reasoning benchmarks
- The entire discovery process costs only $39.90 and completes in 160 minutes, making it accessible to most researchers
- Discovered strategies generalize to held-out benchmarks and different model scales, indicating robust underlying principles