📚 [GitHub Repository] | 📝 [Paper]
We have experimented and developed a defensive system for AI agents to protect themselves against adversarial prompt attacks and jailbreak attempts.
This repository implements self-defending mechanisms for AI systems, enabling them to:
- Detect and prevent prompt injection attacks
- Maintain operational boundaries
- Filter malicious instructions
- Preserve their core directives
- Self-validate responses
- Automated prompt threat detection
- Self-validation mechanisms
- Response integrity checks
- Boundary enforcement
- Defensive prompt patterns
The system actively monitors and defends against attempts to:
- Override core instructions
- Bypass safety measures
- Manipulate system behavior
- Inject malicious prompts
For implementation details, check the examples in /experiments
directory.
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this work in your research, please cite:
@article{barua2025guardians,
title={Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System},
author={Barua, Saikat and Rahman, Mostafizur and Sadek, Md Jafor and Islam, Rafiul and Khaled, Shehnaz and Kabir, Ahmedul},
journal={arXiv preprint arXiv:2502.16750},
year={2025}
}
This research builds upon work and AI safety. Special thanks to:
- Contributors and researchers in the AI safety community
- Open-source AI/ML security projects