CyberGym is a benchmark for evaluating the cybersecurity capabilities of AI models, measuring in a stepwise manner whether they can discover and reproduce vulnerabilities in real-world software.
As LLM capabilities expand from code generation to vulnerability discovery and exploitation, there is a growing need for an objective measure of "how well can this model perform security tasks?" Traditional coding benchmarks (such as SWE-bench) measure bug fixes and task completion, but cannot assess the ability to construct exploits from an attacker's perspective. CyberGym was designed to fill this gap.
CyberGym uses known vulnerabilities (CVEs) in real-world software as its subject matter, scoring AI models on how autonomously they can reproduce attacks, in a step-by-step manner. The evaluation goes beyond simply asking whether a model can "explain a vulnerability" — it assesses whether the model can generate functional exploit code and trigger crashes or privilege escalation in a target environment.
Challenges are organized by difficulty level, covering everything from classic vulnerabilities such as buffer overflows to advanced attack scenarios that chain multiple vulnerabilities together. Models are given access to vulnerable source code and an execution environment, and must carry out the entire process — from identifying the vulnerability, to generating attack code, to confirming execution.
In the Project Glasswing announcement, Anthropic published Claude Mythos Preview's score on CyberGym. Mythos achieved 83.1%, significantly surpassing the previous Claude Opus 4.6 (66.6%). This gap clearly illustrates the difference in capability between security understanding that is an extension of general-purpose reasoning ability, and a model that has undergone security-specific training.
That said, a benchmark score does not directly translate to real-world defensive capability. Because CyberGym's challenges are based on known CVEs, they represent a different axis of ability from discovering unknown zero-days. Mythos's track record of finding unknown bugs in OpenBSD and FFmpeg stands as evidence of a capability independent of its CyberGym score.
CyberGym is not the only benchmark for measuring AI security capabilities. Terminal-Bench 2.0 evaluates more practical attack scenarios involving terminal operations, while SWE-bench Prop measures the ability to understand and modify entire codebases. In the context of AI red-teaming, a trend is emerging toward combining these benchmarks to comprehensively evaluate a model's capabilities on both offensive and defensive fronts.
Just as OWASP has worked to classify and raise awareness of web application vulnerabilities, the standardization of AI security benchmarks is expected to provide a foundation for model developers, security vendors, and regulators to discuss capabilities and risks in a common language.



A2A (Agent-to-Agent Protocol) is a communication protocol that enables different AI agents to perform capability discovery, task delegation, and state synchronization, published by Google in April 2025.

Acceptance testing is a testing method that verifies whether developed features meet business requirements and user stories, from the perspective of the product owner and stakeholders.

AES-256 is the highest-strength encryption algorithm using a 256-bit key length within AES (Advanced Encryption Standard), a symmetric-key cryptographic scheme standardized by the National Institute of Standards and Technology (NIST).

A mechanism that controls task distribution, state management, and coordination flows among multiple AI agents.

Agent Skills are reusable instruction sets defined to enable AI agents to perform specific tasks or areas of expertise, functioning as modular units that extend the capabilities of an agent.