Skip to content

GitsSaikat/Guardian-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System 🛡️

Improving AI Systems with Self-Defense Mechanisms 🤖

📚 [GitHub Repository] | 📝 [Paper]

We have experimented and developed a defensive system for AI agents to protect themselves against adversarial prompt attacks and jailbreak attempts.

Overview

This repository implements self-defending mechanisms for AI systems, enabling them to:

  • Detect and prevent prompt injection attacks
  • Maintain operational boundaries
  • Filter malicious instructions
  • Preserve their core directives
  • Self-validate responses

Core Components

  • Automated prompt threat detection
  • Self-validation mechanisms
  • Response integrity checks
  • Boundary enforcement
  • Defensive prompt patterns

Usage

The system actively monitors and defends against attempts to:

  • Override core instructions
  • Bypass safety measures
  • Manipulate system behavior
  • Inject malicious prompts

For implementation details, check the examples in /experiments directory.

Note

⚠️ Important: Due to safety and security considerations, not all components of this codebase are publicly available. The published code represents a subset of our implementation that demonstrates the core concepts while avoiding potential misuse. Some security-critical components and specific implementation details have been intentionally withheld to prevent exploitation of the system.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Citations

If you use this work in your research, please cite:

@article{barua2025guardians,
  title={Guardians of the Agentic System: Preventing Many Shots Jailbreak with Agentic System},
  author={Barua, Saikat and Rahman, Mostafizur and Sadek, Md Jafor and Islam, Rafiul and Khaled, Shehnaz and Kabir, Ahmedul},
  journal={arXiv preprint arXiv:2502.16750},
  year={2025}
}

Acknowledgments

This research builds upon work and AI safety. Special thanks to:

  • Contributors and researchers in the AI safety community
  • Open-source AI/ML security projects