Facebook Research released ProgramBench on May 5, 2026, a benchmark that reveals a critical limitation in current AI coding capabilities: not a single evaluated model can fully reconstruct complete working codebases from minimal specifications. The benchmark challenges AI agents to rebuild programs using only compiled binaries and documentation—with no access to source code, internet, or prescribed architecture.
Zero Models Achieve Full Program Reconstruction
ProgramBench consists of 200 rigorous, whole-repository generation tasks that turn open-source projects into cleanroom reconstruction challenges. Each task provides an execute-only binary and usage documentation, then evaluates whether the AI can recreate a functionally equivalent program through hidden behavioral tests generated via agent-driven fuzzing.
The results expose a stark gap between code completion and architectural design:
- Fully Resolved: 0% across ALL evaluated models
- Almost-Resolved (≥95% test passage): Claude Opus 4.7 leads at just 3.0%
- All other tested models scored lower than 3% on near-complete solutions
Research Team and Technical Implementation
The benchmark was developed by a 12-person research team including John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. The associated research paper (arXiv:2605.03546) details the methodology.
ProgramBench is available as a Python package via PyPI (pip install programbench), with the test dataset hosted on HuggingFace and a public leaderboard at programbench.com. The GitHub repository has accumulated 204 stars and 9 forks since release.
Why Whole-Program Generation Remains Unsolved
The benchmark uses mini-SWE-agent for testing and evaluation, ensuring that solutions must pass comprehensive behavioral tests rather than simple compilation checks. The 0% fully-resolved rate indicates that while frontier LLMs excel at code completion, modification, and even complex algorithm implementation, they struggle dramatically with:
- Designing complete system architecture from minimal specifications
- Making consistent implementation decisions across an entire codebase
- Creating working programs that satisfy behavioral requirements without reference implementations
This has significant implications for claims about autonomous software development. The gap between assisting programmers and replacing them remains substantial when measured by whole-program competence rather than isolated coding tasks.
Key Takeaways
- Facebook Research's ProgramBench reveals 0% of AI models can fully reconstruct complete programs from binaries and documentation alone
- Even the leading model, Claude Opus 4.7, achieves only 3% almost-resolved rate (≥95% test passage)
- The benchmark consists of 200 whole-repository generation tasks evaluated through hidden behavioral tests
- Results expose a critical limitation: LLMs excel at code completion but struggle with architectural design and whole-program implementation
- ProgramBench is open-source (MIT License) and available via PyPI, with public leaderboard at programbench.com