Researchers from Carnegie Mellon and partner institutions released iOSWorld on June 8, 2026, a new benchmark exposing critical limitations in mobile AI agents. The benchmark includes 133 tasks across 26 custom iOS apps containing interconnected personal data, requiring agents to reason over user identity, history, and preferences rather than complete isolated tasks in sandboxes.
Existing Benchmarks Miss Personalization Requirements
The researchers identify a fundamental gap in mobile agent evaluation: current benchmarks test task completion in isolated environments but ignore personalization. The team argues that "a useful phone agent needs to be personally intelligent" and must reason over data as it exists on actual devices—transactions, messages, travel records, social relationships, and financial activity spanning a simulated user's life.
iOSWorld's 26 newly built iOS apps contain connected data that spans these domains, creating realistic scenarios where information from one app informs decisions in another.
Three Difficulty Levels Reveal Multi-App Performance Gap
The 133 tasks span three difficulty levels:
- Single-app tasks (27): Test one app in isolation
- Multi-app tasks (60): Span 2-8 apps requiring cross-app reasoning
- Memory and personalization tasks (46): Require agents to infer patterns from personal data
Evaluation results reveal significant limitations in current agent architectures. The best overall performance reached only 52%, while multi-app tasks achieved just 37% success—exposing struggles with interconnected personal information reasoning. These results align with broader research showing that existing mobile agents struggle on personalized tasks, with substantial room for improvement.
Vision Plus Accessibility Data Improves Frontier Models
The benchmark tested combinations of visual and XML accessibility information as agent input. Vision combined with XML accessibility information improved frontier models by up to 26 percentage points, while smaller models showed minimal benefits from enhanced accessibility input.
The researchers released iOSWorld as open-source with all apps, seeded data, tasks, rubrics, and evaluation code to enable community progress on personally intelligent agents.
Key Takeaways
- iOSWorld includes 133 tasks across 26 custom iOS apps testing reasoning over personal user data
- Best overall agent performance reached 52%, with multi-app tasks achieving only 37% success
- Multi-app tasks spanning 2-8 applications expose fundamental limitations in current agent architectures
- Vision combined with XML accessibility information improved frontier models by up to 26 percentage points
- The benchmark is fully open-source with apps, data, tasks, and evaluation code available for research