CyberGym

CyberGym

CyberGym is a benchmark for evaluating the cybersecurity capabilities of AI models, measuring in a stepwise manner whether they can discover and reproduce vulnerabilities in real-world software.

Why AI Security Benchmarks Are Needed

As LLM capabilities expand from code generation to vulnerability discovery and exploitation, there is a growing need for an objective measure of "how well can this model perform security tasks?" Traditional coding benchmarks (such as SWE-bench) measure bug fixes and task completion, but cannot assess the ability to construct exploits from an attacker's perspective. CyberGym was designed to fill this gap.

How the Evaluation Works

CyberGym uses known vulnerabilities (CVEs) in real-world software as its subject matter, scoring AI models on how autonomously they can reproduce attacks, in a step-by-step manner. The evaluation goes beyond simply asking whether a model can "explain a vulnerability" — it assesses whether the model can generate functional exploit code and trigger crashes or privilege escalation in a target environment.

Challenges are organized by difficulty level, covering everything from classic vulnerabilities such as buffer overflows to advanced attack scenarios that chain multiple vulnerabilities together. Models are given access to vulnerable source code and an execution environment, and must carry out the entire process — from identifying the vulnerability, to generating attack code, to confirming execution.

What Mythos's Score Reveals

In the Project Glasswing announcement, Anthropic published Claude Mythos Preview's score on CyberGym. Mythos achieved 83.1%, significantly surpassing the previous Claude Opus 4.6 (66.6%). This gap clearly illustrates the difference in capability between security understanding that is an extension of general-purpose reasoning ability, and a model that has undergone security-specific training.

That said, a benchmark score does not directly translate to real-world defensive capability. Because CyberGym's challenges are based on known CVEs, they represent a different axis of ability from discovering unknown zero-days. Mythos's track record of finding unknown bugs in OpenBSD and FFmpeg stands as evidence of a capability independent of its CyberGym score.

Relationship to Other Security Benchmarks

CyberGym is not the only benchmark for measuring AI security capabilities. Terminal-Bench 2.0 evaluates more practical attack scenarios involving terminal operations, while SWE-bench Prop measures the ability to understand and modify entire codebases. In the context of AI red-teaming, a trend is emerging toward combining these benchmarks to comprehensively evaluate a model's capabilities on both offensive and defensive fronts.

Just as OWASP has worked to classify and raise awareness of web application vulnerabilities, the standardization of AI security benchmarks is expected to provide a foundation for model developers, security vendors, and regulators to discuss capabilities and risks in a common language.