LLMs Fail Systematically at Length Scaling Despite Strong Spatial Transfer

Language models demonstrate strong generalization to novel environments but consistently fail when tasks require longer sequential reasoning, according to research published on arXiv on April 16, 2026. The study by researchers Yao Tong, Jiayuan Ye, Anastasia Borovykh, and Reza Shokri reveals a striking asymmetry in how LLMs generalize across different problem dimensions.

Controlled Environment Isolates Generalization Factors

The researchers created a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. This setup enabled testing along two distinct axes:

Spatial transfer: Performance on unseen maps and environments
Length scaling: Handling longer-horizon planning problems

By isolating these variables, the study clarifies which aspects of model performance reflect fundamental capabilities versus training artifacts.

Strong Spatial Transfer, Consistent Length Scaling Failures

The research found a striking asymmetry in generalization:

Models demonstrate strong spatial transfer to novel environments they have never encountered
Models consistently fail under length scaling due to recursive instability in sequential reasoning

This pattern suggests that current transformer architectures have fundamental limitations for tasks requiring extended sequential reasoning, even when provided with relevant training data.

Pipeline Analysis Reveals Distinct Component Roles

The study analyzed different stages of the training and inference pipeline:

Data coverage determines fundamental capability limits
Reinforcement learning enhances training stability without expanding those limits
Inference-time scaling improves performance but cannot resolve length-scaling failures

These findings indicate that simply adding more compute at inference time or using reinforcement learning cannot overcome the recursive instability problem.

Implications for Agentic AI Systems

The recursive instability finding has major implications for agentic AI systems that need to plan over long horizons. Current transformer architectures appear to have fundamental limitations for extended sequential reasoning tasks, which affects:

Multi-step planning and decision-making
Long-horizon task completion
Complex problem decomposition requiring many sequential steps

The research provides clarity on why LLMs struggle with systematic generalization despite strong performance in some domains, helping researchers understand which architectural changes might address these limitations.

Key Takeaways

LLMs show strong spatial transfer to novel environments but fail consistently at length scaling due to recursive instability
Data coverage determines capability limits, while reinforcement learning only enhances stability within those limits
Inference-time compute scaling improves performance but cannot resolve fundamental length-scaling failures
Current transformer architectures have fundamental limitations for tasks requiring extended sequential reasoning
Findings have major implications for agentic AI systems requiring multi-step planning and long-horizon task completion

Controlled Environment Isolates Generalization Factors

The researchers created a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. This setup enabled testing along two distinct axes:

Spatial transfer: Performance on unseen maps and environments

Length scaling: Handling longer-horizon planning problems

By isolating these variables, the study clarifies which aspects of model performance reflect fundamental capabilities versus training artifacts.

Strong Spatial Transfer, Consistent Length Scaling Failures

The research found a striking asymmetry in generalization:

Models demonstrate strong spatial transfer to novel environments they have never encountered

Models consistently fail under length scaling due to recursive instability in sequential reasoning

This pattern suggests that current transformer architectures have fundamental limitations for tasks requiring extended sequential reasoning, even when provided with relevant training data.

Pipeline Analysis Reveals Distinct Component Roles

The study analyzed different stages of the training and inference pipeline:

Data coverage determines fundamental capability limits

Reinforcement learning enhances training stability without expanding those limits

Inference-time scaling improves performance but cannot resolve length-scaling failures

These findings indicate that simply adding more compute at inference time or using reinforcement learning cannot overcome the recursive instability problem.

Implications for Agentic AI Systems

Multi-step planning and decision-making

Long-horizon task completion

Complex problem decomposition requiring many sequential steps

Key Takeaways

LLMs show strong spatial transfer to novel environments but fail consistently at length scaling due to recursive instability

Data coverage determines capability limits, while reinforcement learning only enhances stability within those limits

Inference-time compute scaling improves performance but cannot resolve fundamental length-scaling failures

Current transformer architectures have fundamental limitations for tasks requiring extended sequential reasoning

Findings have major implications for agentic AI systems requiring multi-step planning and long-horizon task completion