Large language models (LLMs) are powerful assistants, but they can also be manipulated to bypass safety guardrails. Attackers may craft "trick instructions" designed to make an AI ignore its own rules, reveal restricted content, or follow unsafe directions. This growing class of manipulation is often referred to as prompt injection.
A key challenge is finding the right balance between caution and utility. Many defenses become too cautious: they block genuinely dangerous requests, but also reject harmless ones simply because they contain suspicious words. At the same time, widely used training examples of harmless prompts can be overly simple, making it difficult for automated detectors to improve further.
To tackle this, we developed a new approach where multiple LLM agents collaborate and learn through repeated interaction. The goal is straightforward: refuse what is truly risky, and respond normally to what is safe with fewer false alarms.
Our system divides the work into two teams:
Generator Team: creates increasingly challenging inputs patterned after real-world deception tactics-designed to test the boundary between safe and unsafe instructions.
Analyzer Team: judges whether each input is harmful or harmless and explains its decision.
The innovation is the iterative loop: when the analyzer successfully identifies a tricky case, the generator responds by producing an even more subtle one. Over repeated rounds, the analyzer becomes more reliable at spotting manipulation—even when the phrasing is designed to confuse it.
Instead of expensive fine-tuning, our method improves using in-context learning by feeding the model a growing set of short logs as part of its instructions.
Analyzer logs capture past mistakes and practical rules of thumb to avoid repeating them. Generator logs record what the analyzer handled well, plus strategies for producing harder test prompts next time. As these logs accumulate, both teams become more effective—without changing the underlying model weights.
In experiments using standard evaluation measures, our approach outperformed a baseline LLM without defenses and also exceeded three existing methods. The improvements were especially clear in F1-score, indicating a better overall trade-off between catching harmful prompts and not over-blocking harmless ones.
We plan to expand the approach with more diverse datasets and broader attack styles, aiming for robust protection in real-world deployments where adversarial inputs evolve quickly and safety systems must keep up.
Authors
Go Sato (Main) (The University of Electro-Communications)
Shusaku Egami (National Institute of Advanced Industrial Science and Technology)
Yasuyuki Tahara (The University of Electro-Communications)
Yuichi Sei (The University of Electro-Communications)
Method of Research
Data/statistical analysis
Subject of Research
Not applicable
Article Title
Addressing Prompt Injection via Dataset Augmentation through Iterative Interactions among LLM Agents
Article Publication Date
12-Nov-2025
COI Statement
The authors declare no competing interests