Anthropic Releases Claude Fable 5, Its Most Powerful AI Yet, With Cyber Safeguards in Place

A new and more powerful artificial intelligence model has been made available by Anthropic. The company's latest release is called Claude Fable 5, which it claims is its most capable model to date. However, what sets this model apart from others of its kind is the presence of cyber safeguards that limit its capabilities in certain areas.

Claude Fable 5 has been made available for public use, but a twin version called Claude Mythos 5 remains locked away from the general public and is only accessible to vetted groups of cybersecurity professionals. Anthropic describes Mythos 5 as 'the strongest cybersecurity model in the world', implying that it possesses capabilities far beyond those of its publicly available counterpart.

The main difference between Fable 5 and Mythos 5 lies in their handling of flagged cyber requests. When a request is identified as potentially malicious, Fable 5 will route it to Claude Opus 4.8 for processing instead of allowing the full range of capabilities to be used. This means that users who attempt to exploit vulnerabilities or engage in other forms of cyber mischief may find themselves blocked from doing so.

The reason behind this approach is clear: Anthropic believes that handing over such powerful tools without proper controls would give attackers an unfair advantage and potentially lead to widespread harm. By limiting the capabilities of Fable 5, the company aims to prevent misuse while still making its technology available for legitimate purposes.

Anthropic has implemented a set of classifiers as part of this approach. These are separate AI systems designed to watch for signs of misuse or attempts to bypass security measures. When a request is flagged by one of these classifiers, Fable 5 will not refuse service outright but instead route the request to Opus 4.8 for processing. This means that users may still be able to access certain capabilities while being monitored and restricted in others.

The company has also implemented a mechanism to block attempts at distillation – essentially, extracting a model's capabilities to train a competing model without safeguards attached. Anthropic claims this is necessary to prevent near-frontier abilities from leaking out into the wild without proper controls.

Anthropic conducted an internal evaluation of Fable 5 with its classifiers set to block rather than fall back on Opus 4.8. The results showed that these classifiers were effective in preventing progress on tasks related to exploitation and cyber attacks, even when attempting to evade safeguards. An external partner also tested the model and found it complied with zero harmful single-turn requests on cyberattack planning or exploit development.

However, this approach is not without its trade-offs. Anthropic acknowledges that the classifiers may sometimes catch harmless requests, resulting in false positives. The company estimates that fallback fires occur in under 5% of all sessions, meaning for more than 95%, Fable 5 behaves like Mythos 5 with full capabilities available.

Anthropic has also implemented robustness measures to prevent universal jailbreaks – essentially, finding a way around the safeguards altogether. An external bug bounty ran over 1,000 hours without producing any successful universal jailbreak or prompt that could strip away the safeguards wholesale. External red teams found none on long-form agentic tasks either.

However, Anthropic concedes it is likely impossible to fully prevent such attempts and its goal is instead to make them slow and costly enough to catch before they are used at scale. The company has also implemented a 30-day retention policy for all traffic on Fable 5 and Mythos 5 models, which will be used to detect novel attacks and jailbreaks that operate across multiple requests.

Anthropic is also changing how it handles data from its models, requiring 30-day retention for all traffic at this capability level. The company claims this helps with detection of novel attacks but may raise concerns among teams with strict data-handling requirements.