Facebook Research's ProgramBench Shows 0% of AI Models Can Rebuild Complete Programs From Scratch

Facebook Research released ProgramBench on May 5, 2026, a benchmark that reveals a critical limitation in current AI coding capabilities: not a single evaluated model can fully reconstruct complete working codebases from minimal specifications. The benchmark challenges AI agents to rebuild programs using only compiled binaries and documentation—with no access to source code, internet, or prescribed architecture.

Zero Models Achieve Full Program Reconstruction

ProgramBench consists of 200 rigorous, whole-repository generation tasks that turn open-source projects into cleanroom reconstruction challenges. Each task provides an execute-only binary and usage documentation, then evaluates whether the AI can recreate a functionally equivalent program through hidden behavioral tests generated via agent-driven fuzzing.

The results expose a stark gap between code completion and architectural design:

Fully Resolved: 0% across ALL evaluated models
Almost-Resolved (≥95% test passage): Claude Opus 4.7 leads at just 3.0%
All other tested models scored lower than 3% on near-complete solutions

Research Team and Technical Implementation

The benchmark was developed by a 12-person research team including John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, and Ofir Press. The associated research paper (arXiv:2605.03546) details the methodology.

ProgramBench is available as a Python package via PyPI (pip install programbench), with the test dataset hosted on HuggingFace and a public leaderboard at programbench.com. The GitHub repository has accumulated 204 stars and 9 forks since release.

Why Whole-Program Generation Remains Unsolved

The benchmark uses mini-SWE-agent for testing and evaluation, ensuring that solutions must pass comprehensive behavioral tests rather than simple compilation checks. The 0% fully-resolved rate indicates that while frontier LLMs excel at code completion, modification, and even complex algorithm implementation, they struggle dramatically with:

Designing complete system architecture from minimal specifications
Making consistent implementation decisions across an entire codebase
Creating working programs that satisfy behavioral requirements without reference implementations

This has significant implications for claims about autonomous software development. The gap between assisting programmers and replacing them remains substantial when measured by whole-program competence rather than isolated coding tasks.

Key Takeaways

Facebook Research's ProgramBench reveals 0% of AI models can fully reconstruct complete programs from binaries and documentation alone
Even the leading model, Claude Opus 4.7, achieves only 3% almost-resolved rate (≥95% test passage)
The benchmark consists of 200 whole-repository generation tasks evaluated through hidden behavioral tests
Results expose a critical limitation: LLMs excel at code completion but struggle with architectural design and whole-program implementation
ProgramBench is open-source (MIT License) and available via PyPI, with public leaderboard at programbench.com

Zero Models Achieve Full Program Reconstruction

The results expose a stark gap between code completion and architectural design:

Fully Resolved: 0% across ALL evaluated models

Almost-Resolved (≥95% test passage): Claude Opus 4.7 leads at just 3.0%

All other tested models scored lower than 3% on near-complete solutions

Research Team and Technical Implementation

Why Whole-Program Generation Remains Unsolved

Designing complete system architecture from minimal specifications

Making consistent implementation decisions across an entire codebase

Creating working programs that satisfy behavioral requirements without reference implementations

Key Takeaways

Facebook Research's ProgramBench reveals 0% of AI models can fully reconstruct complete programs from binaries and documentation alone

Even the leading model, Claude Opus 4.7, achieves only 3% almost-resolved rate (≥95% test passage)

The benchmark consists of 200 whole-repository generation tasks evaluated through hidden behavioral tests

Results expose a critical limitation: LLMs excel at code completion but struggle with architectural design and whole-program implementation

ProgramBench is open-source (MIT License) and available via PyPI, with public leaderboard at programbench.com