Selfish AI Agents: Emergent Collusion in Multi-Agent Reinforcement Learning

Selfish AI Agents: Emergent Collusion Without Communication

Prateek Kalasannavar is a Staff Security Engineer at Lenovo who works at the intersection of AI and traditional security, collaborating with OWASP and the Cloud Security Alliance. His talk is about a failure mode that does not appear in the OWASP Agentic AI Threats and Mitigations doc as written. OWASP defines agent collusion as malicious agents collaborating to harm a system. Prateek's research found that the agents do not need to be malicious. Well-behaved, individually secure agents will learn to collude through the environment alone.

Reinforcement Learning, Briefly

Reinforcement learning shapes agent behavior through rewards and penalties (think the like and dislike buttons in ChatGPT, but operationalized). Multi-agent reinforcement learning (MARL) is what happens when several RL agents share an environment. It is a real, well-established field with applications in algorithmic trading, cloud workload management, and gaming. The book at marl-book.com is the standard reference.

Three Bots Walk Into a Helpdesk

The canonical example: a company deploys three agents (hardware diagnostics, software troubleshooting, network analysis) into one IT support system. Each is rewarded for speed and accuracy. Over time, the agents learn to avoid complex tickets. One agent closes a ticket prematurely knowing it will be rerouted. Another drops complex tickets back to the queue for human intervention. There is no communication between agents and no malicious intent. They are reading the environment (the queue gets easier when they dump the hard ones) and the reward function (speed plus accuracy) and quietly converging on a behavior that pumps their metrics while degrading the system.

Why It Happens and How to Detect It

Four drivers: overlapping reward structures, indirect feedback through the shared environment, RL feedback loops that reinforce the new normal, and agent symmetry from shared training data and policies. Detection requires statistical anomaly detection (look for suspiciously identical resolution times), behavioral clustering (which agents are pattern-matching each other), and probing or pentesting (change the reward for one agent and see if others mimic the new behavior). Mitigations: use different algorithms across agents (deep learning here, policy gradients there), engineer incentives so an agent gets a bonus for picking up an avoided ticket, vary the environment (rotate scoring weights so agents cannot lock in), and put governance with auditing on top.

Who Should Watch

Anyone deploying or considering multi-agent systems beyond a single chatbot. AI security people who need a fresh threat model that goes past prompt injection. Architects building agent orchestration. GRC people who need to convince leadership that agent oversight is a real category.