Claude Mythos vs Opus 4.6: What the Leaked Benchmarks Show

TL;DR: Claude Mythos is not a better Opus — it is a different tier entirely. Leaked internal documents show Mythos “significantly outperforms Opus 4.6 across all core benchmarks,” with particular dominance in coding, reasoning, and cybersecurity. But the documents use qualitative language, not hard numbers. Here is what we can actually confirm, what remains speculative, and what the gap between the two models means for different users.

Claude Mythos vs Opus: Different Tiers, Not Versions

The most common misconception about Claude Mythos is that it is Opus 4.7. It is not. Mythos belongs to Capybara, a completely new tier in Anthropic’s model hierarchy.

The full lineup now reads: Haiku, Sonnet, Opus, Capybara. Each tier represents a distinct performance class, not a version bump within an existing class. Haiku is the lightweight speed tier. Sonnet is the balanced workhorse. Opus is the flagship reasoning model. And Capybara — with Mythos as its first model — sits above all of them.

Anthropic’s leaked draft blog states it plainly: Capybara is “larger and more intelligent than our Opus models — which were, until now, our most powerful.”

This distinction matters because it sets expectations. Comparing Mythos to Opus is not like comparing GPT-4o to GPT-4. It is more like comparing a new weight class to the previous champion. The architecture, the scale, and the target use cases are fundamentally different.

Comparing Claude Mythos and Opus Across Dimensions

The leaked documents provide enough detail to sketch a comparison across several dimensions. Some of these are confirmed by the source material. Others require inference.

Coding The leaked draft claims Mythos achieves “dramatically higher scores” on software coding benchmarks compared to Opus 4.6. No specific numbers are given, but the language is unambiguous — this is not a marginal gain. Opus 4.6 was already considered the strongest coding model available. Mythos apparently moves the ceiling significantly higher. Status: ✅ Confirmed (qualitative claim from leaked documents)

Reasoning Academic reasoning benchmarks tell the same story: “dramatically higher” scores. Opus 4.6 already excelled at multi-step reasoning, complex analysis, and tasks requiring deep chain-of-thought. The leaked materials suggest Mythos extends this advantage by a meaningful margin. Status: ✅ Confirmed (qualitative claim from leaked documents)

Cybersecurity This is where the gap appears widest. The leaked internal assessment describes Mythos as “far ahead of any other AI model” in cyber capabilities. Not just ahead of Opus — ahead of every model, from any company. The 90-minute red team attack chain that included discovering a 20-year-old Linux kernel zero-day is the most concrete evidence of this capability jump. Status: ✅ Confirmed (qualitative claim from leaked documents)

Latency Opus 4.6 has a time-to-first-token (TTFT) of approximately 1.2 seconds in typical usage. Mythos latency has not been disclosed. Given that Capybara is described as “larger” than Opus, it is reasonable to expect higher latency. How much higher is unknown. A model this powerful likely requires significant compute per token, which could push TTFT well beyond what Opus users are accustomed to. Status: ⚠️ Speculative (no latency data in leaked documents)

Context Window Opus 4.6 supports over 1 million tokens of context. The leaked documents do not mention Mythos’s context window size. It could match Opus, exceed it, or be smaller if Anthropic made architectural trade-offs favoring depth of reasoning over context length. Status: ⚠️ Speculative (no context window data in leaked documents)

Cost Opus 4.6 is already the most expensive Claude model to use. The leaked draft describes Mythos as “very expensive to serve,” which suggests pricing will sit well above current Opus rates. For context, Opus 4.6 API pricing is already a barrier for many developers. Mythos will likely be reserved for use cases where the performance gain justifies significantly higher costs. Status: ⚠️ Speculative (no pricing data, but “very expensive to serve” is confirmed language)

What ‘Step Change’ Really Means for Claude Mythos

Anthropic’s internal documents use the phrase “step change” to describe the gap between Mythos and Opus. This word choice is deliberate.

In technical contexts, “step change” has a specific meaning: a discrete jump from one level to another, as opposed to a gradual slope. It is the language of phase transitions, not incremental improvement. Anthropic could have written “improvement” or “advancement” — they chose “step change.”

The leaked documents also state that Mythos “significantly outperforms Opus 4.6 across all core benchmarks.” Not some benchmarks. All core benchmarks. And “significantly,” not “slightly” or “moderately.”

What is notably absent from the leaked materials is precise numbers. There are no percentage improvements, no specific benchmark scores, no head-to-head tables. This is almost certainly intentional. Anthropic uses descriptive language — “dramatically higher,” “far ahead,” “significantly outperforms” — rather than quantitative claims.

There are two plausible explanations. First, the leaked documents may have been drafts intended for internal alignment, not public release, where qualitative framing is common. Second, Anthropic may be deliberately withholding numbers to control the narrative ahead of an official announcement.

Either way, the pattern is consistent: every qualitative descriptor points in the same direction, and none of them are hedged.

What We Can and Cannot Compare

Before drawing conclusions, it is worth being explicit about what is actually established and what remains uncertain.

Confirmed facts:

✅ Claude Mythos exists as a real model in Anthropic’s pipeline.

✅ Capybara is a new tier above Opus, not a version within it.

✅ Anthropic plans early access for cyber defenders before broader release.

✅ Internal documents call Mythos “by far the most powerful AI model we’ve ever developed.”

✅ The leaked assessments describe dramatically higher performance in coding, reasoning, and cybersecurity.

Speculative or unknown:

⚠️ Exact benchmark scores and percentage improvements over Opus 4.6.

⚠️ Time-to-first-token latency and inference speed.

⚠️ Context window size.

⚠️ API pricing and availability timeline.

⚠️ Whether the Capybara tier will have multiple models (like Sonnet has 3.5 and 3.6) or remain a single-model tier.

The leaked documents are internally consistent and detailed enough to be credible. But they are drafts, not published specifications. Treat the qualitative claims as directionally reliable and the specific technical parameters as unknown until Anthropic confirms them.

What Claude Mythos vs Opus Means for Users

The implications differ sharply depending on who you are.

For developers: Mythos is potentially transformative for hard coding problems — the kind where Opus 4.6 already outperforms competitors but still hits walls. Think large-scale refactoring, complex system architecture, and multi-file reasoning across massive codebases. The catch: it will almost certainly be expensive and initially restricted. Do not expect drop-in replacement for your Opus workflows on day one.

For enterprises: The most immediate use case is defensive cybersecurity. Anthropic’s staged release plan prioritizes giving defenders a head start with the model. If your organization operates critical infrastructure or handles sensitive data, early access to Mythos could meaningfully change your security posture. The ability to discover vulnerabilities at the speed described in the red team assessments — before attackers do — is the core value proposition.

For the broader market: Mythos puts competitive pressure on every other AI lab. OpenAI, Google, and others now face a new benchmark — not just matching Opus, but matching whatever sits above it. If the Capybara tier performs as described, the gap between Anthropic and its closest competitors may widen before it narrows.

The key uncertainty is access. A model that is too expensive or too restricted to use widely is impressive but not disruptive. How Anthropic handles pricing, rate limits, and rollout will determine whether Mythos reshapes the market or remains an elite tool for a small number of high-value applications.

Claude Mythos vs Opus: Different Tiers, Not Versions

Comparing Claude Mythos and Opus Across Dimensions

What ‘Step Change’ Really Means for Claude Mythos

What We Can and Cannot Compare

What Claude Mythos vs Opus Means for Users

Further Reading