Forkrun: NUMA-Aware Shell Parallelizer Achieves 200,000+ Batch Dispatches Per Second

Developer jkool702 released forkrun on GitHub on March 27, 2026, describing it as the culmination of a 10-year journey to optimize shell parallelization. The lock-free, SIMD-accelerated tool achieves 200,000+ batch dispatches per second on a 14-core system—400x faster than GNU Parallel—while maintaining 95-99% CPU utilization across all cores even with near-zero workloads.

Revolutionary NUMA-Aware Architecture Eliminates Performance Bottlenecks

Forkrun implements several groundbreaking techniques that traditional parallelization tools lack:

Born-local NUMA: Input data is splice()'d into a shared memfd, then pages are placed on target NUMA nodes via set_mempolicy(MPOL_BIND) before any worker touches them. Each NUMA node only claims work already born-local on its node, eliminating cross-node memory traffic.

SIMD scanning: Per-node indexers use AVX2/NEON instructions to find line boundaries at speeds approaching memory bandwidth, dramatically faster than byte-by-byte scanning.

Lock-free claiming: Workers claim batches with a single atomic_fetch_add operation—no locks, no compare-and-swap retry loops that cause contention.

Memory management: A background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.

Benchmarks Show 50-400x Performance Improvements

On the developer's 14-core/28-thread i9-7940x system, forkrun demonstrates:

200,000+ batch dispatches per second versus ~500 for GNU Parallel
95-99% CPU utilization across all 28 logical cores with bash no-ops, compared to ~6% for GNU Parallel
Typically 50-400x faster on high-frequency, low-latency workloads
In fastest mode (-b), can exceed 1 billion lines per second

The benchmarks intentionally use near-zero work per task to measure the parallelization framework's overhead rather than external tool performance.

Drop-In Replacement Ships as Single Self-Extracting Bash File

Forkrun ships as a single bash file with an embedded, self-extracting C extension—no Perl, no Python, no complex installation required. The tool serves as a mostly drop-in replacement for xargs -P and GNU parallel with full native support for parallelizing arbitrary shell functions. Binaries are built in public GitHub Actions for transparency. Installation requires just two commands to source the script and begin using frun.

Key Takeaways

Forkrun achieves 200,000+ batch dispatches per second on a 14-core system, approximately 400x faster than GNU Parallel's ~500 dispatches per second
The tool maintains 95-99% CPU utilization across all cores even with near-zero workloads, compared to GNU Parallel's ~6% utilization in the same scenario
NUMA-aware architecture with born-local memory placement, SIMD scanning, and lock-free claiming eliminates traditional parallelization bottlenecks
Ships as a single self-extracting bash file with embedded C extension, requiring no Perl, Python, or complex installation
Represents the culmination of a 10-year optimization journey and can exceed 1 billion lines per second in fastest mode

Revolutionary NUMA-Aware Architecture Eliminates Performance Bottlenecks

Forkrun implements several groundbreaking techniques that traditional parallelization tools lack:

SIMD scanning: Per-node indexers use AVX2/NEON instructions to find line boundaries at speeds approaching memory bandwidth, dramatically faster than byte-by-byte scanning.

Lock-free claiming: Workers claim batches with a single atomic_fetch_add operation—no locks, no compare-and-swap retry loops that cause contention.

Memory management: A background thread uses fallocate(PUNCH_HOLE) to reclaim space without breaking the logical offset system.

Benchmarks Show 50-400x Performance Improvements

On the developer's 14-core/28-thread i9-7940x system, forkrun demonstrates:

200,000+ batch dispatches per second versus ~500 for GNU Parallel

95-99% CPU utilization across all 28 logical cores with bash no-ops, compared to ~6% for GNU Parallel

Typically 50-400x faster on high-frequency, low-latency workloads

In fastest mode (-b), can exceed 1 billion lines per second

The benchmarks intentionally use near-zero work per task to measure the parallelization framework's overhead rather than external tool performance.

Drop-In Replacement Ships as Single Self-Extracting Bash File

Key Takeaways

Forkrun achieves 200,000+ batch dispatches per second on a 14-core system, approximately 400x faster than GNU Parallel's ~500 dispatches per second

The tool maintains 95-99% CPU utilization across all cores even with near-zero workloads, compared to GNU Parallel's ~6% utilization in the same scenario

NUMA-aware architecture with born-local memory placement, SIMD scanning, and lock-free claiming eliminates traditional parallelization bottlenecks

Ships as a single self-extracting bash file with embedded C extension, requiring no Perl, Python, or complex installation

Represents the culmination of a 10-year optimization journey and can exceed 1 billion lines per second in fastest mode