Shepherd Meta-Agent Framework Achieves 54.7% on CooperBench with Git-Like Execution Traces

Researchers from Stanford have introduced Shepherd, a runtime substrate for meta-agent operations that records every agent-environment interaction as a typed event in a Git-like execution trace. Released via arXiv on May 11, 2026, the system achieves a 54.7% success rate on CooperBench through runtime intervention, nearly doubling the baseline 28.8% performance.

Formal Verification Meets Agent Infrastructure

Shepherd formalizes meta-agent operations on target agents as functions, with core operations mechanized in Lean for formal verification. This approach provides mathematical guarantees about the correctness of meta-agent operations, addressing reliability concerns in agentic systems.

The researchers explain that Shepherd "records every agent-environment interaction as a typed event in a Git-like execution trace, enabling any past state to be forked and replayed." The system achieves significant performance advantages:

Agent process and filesystem forking 5× faster than Docker
Greater than 95% prompt-cache reuse on replay
Efficient exploration of counterfactual execution paths without full process recreation

Three Demonstrated Applications Show Practical Impact

The research team demonstrated three applications of the Shepherd framework:

Runtime Intervention: A live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench by monitoring and correcting agent behavior in real-time.

Counterfactual Meta-Optimization: Branching exploration outperforms baselines across four benchmarks by up to 11 points while reducing wall-clock time by up to 58%. This approach enables agents to explore alternative decision paths efficiently.

Tree-RL Training: Forking rollouts at selected turns improves TerminalBench-2 performance from 34.2% to 39.4%, demonstrating that lightweight branching enables more effective reinforcement learning for agent systems.

Open-Source Release Supports Future Research

The team has open-sourced the Shepherd system to support future research in meta-agent infrastructure. The researchers state: "These results establish Shepherd as an efficient infrastructure for programming meta-agents."

The Git-like execution model provides a familiar mental framework for developers while enabling sophisticated meta-agent capabilities such as time-travel debugging, counterfactual reasoning, and efficient state exploration. The formal verification component ensures that these operations maintain correctness guarantees even as complexity increases.

Key Takeaways

Shepherd is a meta-agent runtime that records agent interactions as typed events in a Git-like execution trace with formal verification in Lean
Runtime intervention with a live supervisor increases pair coding success from 28.8% to 54.7% on CooperBench
The system forks agent processes 5× faster than Docker with greater than 95% prompt-cache reuse on replay
Counterfactual meta-optimization outperforms baselines by up to 11 points while reducing wall-clock time by up to 58%
The complete system has been open-sourced to support future meta-agent research

Formal Verification Meets Agent Infrastructure

Agent process and filesystem forking 5× faster than Docker

Greater than 95% prompt-cache reuse on replay

Efficient exploration of counterfactual execution paths without full process recreation

Three Demonstrated Applications Show Practical Impact

The research team demonstrated three applications of the Shepherd framework:

Runtime Intervention: A live supervisor increases pair coding pass rates from 28.8% to 54.7% on CooperBench by monitoring and correcting agent behavior in real-time.

Open-Source Release Supports Future Research

Key Takeaways

Shepherd is a meta-agent runtime that records agent interactions as typed events in a Git-like execution trace with formal verification in Lean

Runtime intervention with a live supervisor increases pair coding success from 28.8% to 54.7% on CooperBench

The system forks agent processes 5× faster than Docker with greater than 95% prompt-cache reuse on replay

Counterfactual meta-optimization outperforms baselines by up to 11 points while reducing wall-clock time by up to 58%

The complete system has been open-sourced to support future meta-agent research