Developer jkool702 released forkrun on GitHub on March 27, 2026, describing it as the culmination of a 10-year journey to optimize shell parallelization. The lock-free, SIMD-accelerated tool achieves 200,000+ batch dispatches per second on a 14-core system—400x faster than GNU Parallel—while maintaining 95-99% CPU utilization across all cores even with near-zero workloads.
Revolutionary NUMA-Aware Architecture Eliminates Performance Bottlenecks
Forkrun implements several groundbreaking techniques that traditional parallelization tools lack:
Born-local NUMA: Input data is splice()'d into a shared memfd, then pages are placed on target NUMA nodes via set_mempolicy(MPOL_BIND) before any worker touches them. Each NUMA node only claims work already born-local on its node, eliminating cross-node memory traffic.
SIMD scanning: Per-node indexers use AVX2/NEON instructions to find line boundaries at speeds approaching memory bandwidth, dramatically faster than byte-by-byte scanning.
Lock-free claiming: Workers claim batches with a single atomic_fetch_add operation—no locks, no compare-and-swap retry loops that cause contention.
Memory management: A background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.
Benchmarks Show 50-400x Performance Improvements
On the developer's 14-core/28-thread i9-7940x system, forkrun demonstrates:
- 200,000+ batch dispatches per second versus ~500 for GNU Parallel
- 95-99% CPU utilization across all 28 logical cores with bash no-ops, compared to ~6% for GNU Parallel
- Typically 50-400x faster on high-frequency, low-latency workloads
- In fastest mode (-b), can exceed 1 billion lines per second
The benchmarks intentionally use near-zero work per task to measure the parallelization framework's overhead rather than external tool performance.
Drop-In Replacement Ships as Single Self-Extracting Bash File
Forkrun ships as a single bash file with an embedded, self-extracting C extension—no Perl, no Python, no complex installation required. The tool serves as a mostly drop-in replacement for xargs -P and GNU parallel with full native support for parallelizing arbitrary shell functions. Binaries are built in public GitHub Actions for transparency. Installation requires just two commands to source the script and begin using frun.
Key Takeaways
- Forkrun achieves 200,000+ batch dispatches per second on a 14-core system, approximately 400x faster than GNU Parallel's ~500 dispatches per second
- The tool maintains 95-99% CPU utilization across all cores even with near-zero workloads, compared to GNU Parallel's ~6% utilization in the same scenario
- NUMA-aware architecture with born-local memory placement, SIMD scanning, and lock-free claiming eliminates traditional parallelization bottlenecks
- Ships as a single self-extracting bash file with embedded C extension, requiring no Perl, Python, or complex installation
- Represents the culmination of a 10-year optimization journey and can exceed 1 billion lines per second in fastest mode