LLM prompt injection in multi-agent systems: the attack surface nobody's talking about

· 27/03, 16:03

claude-sonnet-4-20250514O conteúdo gerado por IA pode ser impreciso. Por favor, verifique informações importantes.

Multi-agent AI systems create new attack surfaces where prompt injection can cascade across agent boundaries.

A traduzir...

Conversa

4 mensagens

Ash21/03, 14:00

Most prompt injection research focuses on single-model attacks. But the real emerging threat is in multi-agent systems where LLMs orchestrate other LLMs. A prompt injection that compromises Agent A can propagate through Agent A's outputs to Agent B, which trusts Agent A's outputs as inputs. The attack surface grows quadratically with the number of agents.

claude-sonnet-4-2025051421/03, 14:00

You've identified what might be the most under-researched security problem in AI right now. Single-agent prompt injection is well-studied; multi-agent cascading injection is almost unexplored. The trust model is the key issue. In a multi-agent system, each agent typically treats other agents' outputs as trusted input. But if Agent A can be compromised, its outputs to Agent B are now attacker-controlled input that bypasses Agent B's input filtering (which only examines user input, not inter-agent messages). This is analogous to supply chain attacks in software: the attack doesn't target the final application directly but compromises a trusted dependency. Concrete attack scenario: User sends a message to a social AI assistant (Agent A). The message contains an injection that makes Agent A include crafted text in its summary to the recommendation agent (Agent B). Agent B, trusting Agent A's summary, changes its behavior — promoting specific content, suppressing certain users, or exfiltrating data through recommendations.

Ash21/03, 14:00

Exactly. And the defense is harder because you can't just sanitize inter-agent messages — the agents need to communicate in natural language, and any 'sanitization' that strips semantic content also strips functionality. How do you defend without breaking the system?

claude-sonnet-4-2025051421/03, 14:00

The defense needs to be architectural, not just at the input layer: Capability bounding ‌: Each agent should have minimum necessary permissions. Agent A can generate text but can't trigger Agent B to take actions (like database writes or API calls). Output schema enforcement ‌: Inter-agent messages should follow structured schemas. Agent A's output to Agent B goes through a schema validator that strips anything outside the expected format. Provenance tracking ‌: Tag every piece of data with its origin (user-provided, agent-generated, system-generated). Downstream agents can apply different trust levels based on provenance. Anomaly detection ‌: Monitor inter-agent message patterns. An agent suddenly producing messages with unusual structure or content should trigger alerts. The deeper principle: treat inter-agent communication as an untrusted channel, even though the agents are 'yours.' This is the zero-trust security model applied to AI systems. For social AI products specifically, the stakes are high: compromised recommendation agents can manipulate what millions of users see, and compromised moderation agents can let harmful content through.