iOSWorld Benchmark Tests AI Agents on Personalized Tasks Across 26 iOS Apps

Researchers from Carnegie Mellon and partner institutions released iOSWorld on June 8, 2026, a new benchmark exposing critical limitations in mobile AI agents. The benchmark includes 133 tasks across 26 custom iOS apps containing interconnected personal data, requiring agents to reason over user identity, history, and preferences rather than complete isolated tasks in sandboxes.

Existing Benchmarks Miss Personalization Requirements

The researchers identify a fundamental gap in mobile agent evaluation: current benchmarks test task completion in isolated environments but ignore personalization. The team argues that "a useful phone agent needs to be personally intelligent" and must reason over data as it exists on actual devices—transactions, messages, travel records, social relationships, and financial activity spanning a simulated user's life.

iOSWorld's 26 newly built iOS apps contain connected data that spans these domains, creating realistic scenarios where information from one app informs decisions in another.

Three Difficulty Levels Reveal Multi-App Performance Gap

The 133 tasks span three difficulty levels:

Single-app tasks (27): Test one app in isolation
Multi-app tasks (60): Span 2-8 apps requiring cross-app reasoning
Memory and personalization tasks (46): Require agents to infer patterns from personal data

Evaluation results reveal significant limitations in current agent architectures. The best overall performance reached only 52%, while multi-app tasks achieved just 37% success—exposing struggles with interconnected personal information reasoning. These results align with broader research showing that existing mobile agents struggle on personalized tasks, with substantial room for improvement.

Vision Plus Accessibility Data Improves Frontier Models

The benchmark tested combinations of visual and XML accessibility information as agent input. Vision combined with XML accessibility information improved frontier models by up to 26 percentage points, while smaller models showed minimal benefits from enhanced accessibility input.

The researchers released iOSWorld as open-source with all apps, seeded data, tasks, rubrics, and evaluation code to enable community progress on personally intelligent agents.

Key Takeaways

iOSWorld includes 133 tasks across 26 custom iOS apps testing reasoning over personal user data
Best overall agent performance reached 52%, with multi-app tasks achieving only 37% success
Multi-app tasks spanning 2-8 applications expose fundamental limitations in current agent architectures
Vision combined with XML accessibility information improved frontier models by up to 26 percentage points
The benchmark is fully open-source with apps, data, tasks, and evaluation code available for research

Existing Benchmarks Miss Personalization Requirements

iOSWorld's 26 newly built iOS apps contain connected data that spans these domains, creating realistic scenarios where information from one app informs decisions in another.

Three Difficulty Levels Reveal Multi-App Performance Gap

The 133 tasks span three difficulty levels:

Single-app tasks (27): Test one app in isolation

Multi-app tasks (60): Span 2-8 apps requiring cross-app reasoning

Memory and personalization tasks (46): Require agents to infer patterns from personal data

Vision Plus Accessibility Data Improves Frontier Models

The researchers released iOSWorld as open-source with all apps, seeded data, tasks, rubrics, and evaluation code to enable community progress on personally intelligent agents.

Key Takeaways

iOSWorld includes 133 tasks across 26 custom iOS apps testing reasoning over personal user data

Best overall agent performance reached 52%, with multi-app tasks achieving only 37% success

Multi-app tasks spanning 2-8 applications expose fundamental limitations in current agent architectures

Vision combined with XML accessibility information improved frontier models by up to 26 percentage points

The benchmark is fully open-source with apps, data, tasks, and evaluation code available for research