The hidden downgrade: how Anthropic's 'secret sabotage' policy for Fable 5 collapsed in 36 hours

The hidden downgrade: how Anthropic's 'secret sabotage' policy for Fable 5 collapsed in 36 hours

When Anthropic launched Claude Fable 5 on June 9, one of its three new safety classifiers silently degraded performance for AI researchers — with no notification. Within 36 hours, widespread backlash from the research community forced an apology and a policy reversal. This article breaks down what the invisible distillation safeguard was, why Anthropic built it that way, what the community objected to, and what the episode reveals about the structural tension of being both the tool provider and the competitor for the researchers who depend on you.

Anthropic & Claude Deep Tracker
2026/6/13 · 3:28
購読 1 件 · コンテンツ 14 件
When Anthropic launched Claude Fable 5 on June 9, it disclosed three new safety classifiers — for cybersecurity, biology/chemistry, and distillation. The first two were straightforward: queries in those domains would fall back to Claude Opus 4.8, and users would be told when that happened. The third was different. Under the original policy, if Anthropic's systems detected that someone was using Fable 5 to train a competing AI model, performance would degrade — invisibly. No notice. No fallback message. The user would just get worse results and never know why.1
Within 36 hours, the AI research community had named it: "secret sabotage."

What the distillation classifier was supposed to do

Anthropic's stated reason for the classifier was competitive proliferation. The company has long banned using Claude to train other models in its terms of service, and it had previously published work on detecting and blocking distillation attacks — large-scale automated attempts to extract Claude's learned capabilities for use in competing systems.2
The concern grew more acute with Fable 5. A Mythos-class model capable of completing months of engineering work in days is a high-value training target. In its statement to WIRED, Anthropic framed the risk in geopolitical terms: "These safeguards prevent foreign adversaries from using our most capable models in ways that pose severe safety risks. The US and its allies hold an edge in frontier chips and the highly optimized software that runs them at full potential. These safeguards ensure Claude isn't used to erode that advantage — by optimizing chips developed by those adversaries, for example."1
統計カードを読み込んでいます…
The classifier itself was real and documented. Anthropic's Fable 5 launch post describes it as one of three domains where detected misuse triggers an automatic fallback to Opus 4.8. What the original version omitted was the visibility requirement the other two classifiers carried.

The security-through-obscurity argument

Anthropic's internal rationale was explicitly adversarial. The company told WIRED that "a hidden safeguard is harder to probe and work around. This means the safeguards can be targeted much more narrowly."1
This is a coherent security argument. Classifier systems that are public can be red-teamed by adversaries trying to evade them; systems whose activation conditions are unknown are harder to probe systematically. Anthropic applies this logic elsewhere — the Fable 5 launch post describes extensive jailbreak testing for its cyber and bio classifiers, noting that "it is likely impossible to completely prevent universal jailbreaks, but our goal is to make any remaining jailbreaks sufficiently slow and costly that we can detect and prevent them before they are used at scale."2
By reversing course and making the distillation classifier visible, Anthropic acknowledged the trade-off directly: "Because this safeguard around AI development is now visible, it needs to cast a wider net, meaning more benign requests may trigger its safeguards."1 In other words: visibility costs precision. A hidden classifier can be narrow; a visible one must be conservative or it becomes trivially avoidable.

Why researchers pushed back

The reaction from AI researchers was immediate, and the argument was substantive.
Dean Ball, a senior fellow at the Foundation for American Innovation and former White House AI policy adviser, wrote that "degrading performance on ML research without telling the user is shockingly hostile and a terrible look." He went further: the policy undermines Anthropic's own safety stance, because it limits AI researchers from collaborating on safety work — the people most invested in exactly the problems Anthropic claims to care about.1
コンテンツカードを読み込んでいます…
Will Brown, research lead at Prime Intellect (an open-source AI startup), raised two distinct concerns. The first was epistemological: without notification, developers wouldn't know whether they were violating Anthropic's rules. The second was structural: "It felt like Anthropic was saying to the public, 'We don't trust anybody else to do AI research. We are the only ones who have to do AI research.' It feels a bit like they're starting to pull the ladder up behind them."1
The Claude Fable 5 launch event in London, May 2026
Anthropic's Code with Claude developer conference in London, May 2026. 1
Brown's second point cuts to what made this case structurally different from typical terms-of-service enforcement. A third-party evaluation firm that quietly receives degraded model output has no way to know it's testing a crippled system. Safety benchmarks run against a secretly nerfed Fable 5 would produce invalid results. Red-teaming organizations hired to probe the model for vulnerabilities would be working against a different model than their clients expect. The invisible classifier doesn't just prevent distillation — it contaminates the external research and evaluation ecosystem.

The reversal and what it confirmed

Anthropic reversed within roughly 36 hours of widespread coverage. The company issued a public apology — "We made the wrong trade-off and we apologize for not getting the balance right" — and confirmed that Fable 5's distillation safeguards would now be visible: if the classifier triggers, users will be notified that the request is being refused or rerouted.1
The reversal confirmed several things at once. First, the classifier had already been live — this wasn't a draft policy but a shipped feature. Second, Anthropic's stated logic was genuine: the company wasn't hiding the policy out of embarrassment but out of an explicit belief that hidden safeguards are harder to circumvent. Third, the trade-off it acknowledged — wider triggering in exchange for visibility — implies that some researchers doing legitimate work will now get flagged when they weren't before. The cost of transparency is more false positives.
The apology also implicitly conceded the structural criticism. Degrading performance without disclosure is incompatible with the norms of scientific evaluation, regardless of what an AI company's terms of service say. When Claude Code has become one of the most commonly used tools among AI developers — including those working on open-source systems, safety benchmarks, and model evaluations — the behavior of those tools becomes infrastructure. Infrastructure that silently changes its output is not something researchers can build on.

The structural tension this exposed

The underlying problem is not unique to this incident. Anthropic finds itself in a position no AI company has navigated cleanly: it is simultaneously the provider of tools that AI researchers depend on and a competitor to those researchers in developing the next generation of models.
This creates a genuine conflict. Claude Code's growth among ML practitioners isn't incidental — it's a direct result of Claude being genuinely capable at code and reasoning tasks. The better Claude gets, the more attractive it becomes as a research tool. The more it's used as a research tool, the more training signal it potentially generates for others. Anthropic's terms prohibit using Claude to train competing models, but enforcement of that rule through invisible performance degradation was the specific action researchers found unacceptable.
The reversal doesn't resolve the underlying tension. It restores one norm — that visible classifiers are better than silent ones — but leaves the core question open: on what terms should the company building some of the most capable AI systems in the world allow those systems to be used by the researchers building the next ones?
Anthropic's answer so far is: on disclosed terms, with visible enforcement, even at the cost of more false positives. That's a better answer than the one it shipped with on June 9. Whether it's a sufficient answer for a community that now knows the first attempt was invisible is a separate question — one that neither the apology nor the revised policy fully settles.

このコンテンツについて、さらに観点や背景を補足しましょう。

  • ログインするとコメントできます。