FlexSQL Achieves State-of-the-Art Text-to-SQL Performance
Researchers from StringNLPLAB published FlexSQL on arXiv on May 4, 2026, introducing a text-to-SQL agent that achieves 65.4% accuracy on the Spider2-Snow benchmark using gpt-oss-120b. The system outperforms stronger baselines using larger models by enabling flexible database interaction throughout the reasoning process.
Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems retrieve schema elements once upfront and only revisit the database for post-hoc error correction, limiting their ability to recover from early mistakes.
Dynamic Database Exploration Enables Better Reasoning
FlexSQL's core innovation is allowing the agent to explore schema structure, inspect data values, and run verification queries at any point during reasoning. The system generates diverse execution plans covering multiple query interpretations and implements each plan in either SQL or Python depending on task requirements.
Python's flexible control structures prove especially useful for problems requiring iterative refinement. A two-tiered repair mechanism enables backtracking from code-level errors to plan-level revisions, rather than forcing the agent into a predetermined interaction pattern.
Outperforms Larger Models on Benchmark Tests
On the Spider2-Snow benchmark, FlexSQL using gpt-oss-120b outperformed strong baselines using substantially larger models, including DSR-SQL with DeepSeek-R1 and ReFoRCE with gpt-o3. When integrated into Claude Code as skills, FlexSQL yields over 10% relative improvement.
Ablation studies confirmed that flexible exploration and flexible execution jointly contribute to effectiveness. Agent-cited numerical descriptors showed strong agreement with ground truth values. On 15 unseen TDC classification endpoints, FlexSQL matched TxGemma-9B-Chat performance on five tasks and produced non-trivial rank correlations on three held-out regression endpoints despite never training on downstream regression.
Key Takeaways
- FlexSQL achieves 65.4% accuracy on Spider2-Snow benchmark using gpt-oss-120b, outperforming larger model baselines
- System enables flexible database exploration at any point during reasoning, rather than fixed upfront schema retrieval
- Two-tiered repair mechanism allows backtracking from code-level to plan-level revisions
- Integration with Claude Code produces over 10% relative improvement
- Matches TxGemma-9B-Chat on 5 of 15 unseen classification tasks despite model size difference