An evaluation method that systematically tests AI system vulnerabilities from an attacker's perspective to proactively identify safety risks.
## What is AI Red Teaming AI Red Teaming is an evaluation methodology that systematically tests AI systems for vulnerabilities from an attacker's perspective, identifying safety risks before deployment in production. It applies the concept of "red team exercises" from the military and security fields to AI. ### What Is Being Tested The risks examined by AI Red Teaming are broader than those in traditional software security. - **Prompt injection**: Bypassing model constraints through input manipulation - **Extraction of sensitive information**: Drawing out personal data or trade secrets contained in training data - **Harmful content generation**: Inducing outputs that slip past safety filters - **Violation of instruction hierarchy**: Overwriting system prompts or deviating from assigned roles A large-scale evaluation conducted by the UK AI Safety Institute reported over 62,000 vulnerabilities, highlighting the extensive attack surface of AI systems. ### How to Conduct It Specialized teams comprehensively test systems by combining techniques such as prompt modification, multilingual attacks, and multi-turn manipulation. A hybrid approach is considered effective, in which automated tools (such as Garak and PyRIT) generate large volumes of test cases while human experts supplement them with creative attack scenarios. The EU AI Act requires appropriate testing for high-risk AI systems, and AI Red Teaming is attracting growing attention as a means of fulfilling that requirement.


AI governance refers to the organizational policies, processes, and oversight mechanisms that ensure ethics, transparency, and accountability in AI system development and operation.

An AI agent is an AI system that autonomously formulates plans toward given goals and executes tasks by invoking external tools.

Agentic AI is a general term for AI systems that interpret goals and autonomously repeat the cycle of planning, executing, and verifying actions without requiring step-by-step human instruction.


Closing the "Invisible Attack Vector" in AI Chat — An Implementation Guide to Preventing Prompt Injection via DB