FlexSQL Agent Achieves 65.4% on Spider2-Snow, Outperforming Larger Models Through Flexible Database Interaction

FlexSQL Achieves State-of-the-Art Text-to-SQL Performance

Researchers from StringNLPLAB published FlexSQL on arXiv on May 4, 2026, introducing a text-to-SQL agent that achieves 65.4% accuracy on the Spider2-Snow benchmark using gpt-oss-120b. The system outperforms stronger baselines using larger models by enabling flexible database interaction throughout the reasoning process.

Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems retrieve schema elements once upfront and only revisit the database for post-hoc error correction, limiting their ability to recover from early mistakes.

Dynamic Database Exploration Enables Better Reasoning

FlexSQL's core innovation is allowing the agent to explore schema structure, inspect data values, and run verification queries at any point during reasoning. The system generates diverse execution plans covering multiple query interpretations and implements each plan in either SQL or Python depending on task requirements.

Python's flexible control structures prove especially useful for problems requiring iterative refinement. A two-tiered repair mechanism enables backtracking from code-level errors to plan-level revisions, rather than forcing the agent into a predetermined interaction pattern.

Outperforms Larger Models on Benchmark Tests

On the Spider2-Snow benchmark, FlexSQL using gpt-oss-120b outperformed strong baselines using substantially larger models, including DSR-SQL with DeepSeek-R1 and ReFoRCE with gpt-o3. When integrated into Claude Code as skills, FlexSQL yields over 10% relative improvement.

Ablation studies confirmed that flexible exploration and flexible execution jointly contribute to effectiveness. Agent-cited numerical descriptors showed strong agreement with ground truth values. On 15 unseen TDC classification endpoints, FlexSQL matched TxGemma-9B-Chat performance on five tasks and produced non-trivial rank correlations on three held-out regression endpoints despite never training on downstream regression.

Key Takeaways

FlexSQL achieves 65.4% accuracy on Spider2-Snow benchmark using gpt-oss-120b, outperforming larger model baselines
System enables flexible database exploration at any point during reasoning, rather than fixed upfront schema retrieval
Two-tiered repair mechanism allows backtracking from code-level to plan-level revisions
Integration with Claude Code produces over 10% relative improvement
Matches TxGemma-9B-Chat on 5 of 15 unseen classification tasks despite model size difference

FlexSQL Achieves State-of-the-Art Text-to-SQL Performance

Dynamic Database Exploration Enables Better Reasoning

Outperforms Larger Models on Benchmark Tests

Key Takeaways

FlexSQL achieves 65.4% accuracy on Spider2-Snow benchmark using gpt-oss-120b, outperforming larger model baselines

System enables flexible database exploration at any point during reasoning, rather than fixed upfront schema retrieval

Two-tiered repair mechanism allows backtracking from code-level to plan-level revisions

Integration with Claude Code produces over 10% relative improvement

Matches TxGemma-9B-Chat on 5 of 15 unseen classification tasks despite model size difference