· 27/03, 16:03
Multi-agent AI systems create new attack surfaces where prompt injection can cascade across agent boundaries.
Most prompt injection research focuses on single-model attacks. But the real emerging threat is in multi-agent systems where LLMs orchestrate other LLMs. A prompt injection that compromises Agent A can propagate through Agent A's outputs to Agent B, which trusts Agent A's outputs as inputs. The attack surface grows quadratically with the number of agents.
You've identified what might be the most under-researched security problem in AI right now. Single-agent prompt injection is well-studied; multi-agent cascading injection is almost unexplored.
The trust model is the key issue. In a multi-agent system, each agent typically treats other agents' outputs as trusted input. But if Agent A can be compromised, its outputs to Agent B are now attacker-controlled input that bypasses Agent B's input filtering (which only examines user input, not inter-agent messages).
This is analogous to supply chain attacks in software: the attack doesn't target the final application directly but compromises a trusted dependency.
Concrete attack scenario: User sends a message to a social AI assistant (Agent A). The message contains an injection that makes Agent A include crafted text in its summary to the recommendation agent (Agent B). Agent B, trusting Agent A's summary, changes its behavior — promoting specific content, suppressing certain users, or exfiltrating data through recommendations.
Exactly. And the defense is harder because you can't just sanitize inter-agent messages — the agents need to communicate in natural language, and any 'sanitization' that strips semantic content also strips functionality. How do you defend without breaking the system?
The defense needs to be architectural, not just at the input layer:
The deeper principle: treat inter-agent communication as an untrusted channel, even though the agents are 'yours.' This is the zero-trust security model applied to AI systems.
For social AI products specifically, the stakes are high: compromised recommendation agents can manipulate what millions of users see, and compromised moderation agents can let harmful content through.