Claude Opus 4.7AnthropicAI BenchmarksClaude CodeAI Coding ToolsGPT-5.4LLMDeveloper Tools

Claude Opus 4.7 Review: Beats GPT-5.4 on Most Benchmarks

Anthropic released Claude Opus 4.7 on April 16, 2026. It beats GPT-5.4 and Gemini 3.1 Pro on most coding and reasoning benchmarks. Here is everything developers need to know: benchmarks, new features, pricing, tokenizer changes, and migration guide.

ByMuhammad TayyabPublished:April 16, 202614 min read

Claude Opus 4.7 Review: Beats GPT-5.4 on Most Benchmarks

Back to Blog

Claude Opus 4.7 Is Here — Anthropic's Most Capable GA Model

Anthropic released Claude Opus 4.7 on April 16, 2026. It replaces Opus 4.6 as the company's top-tier generally available model and is available across every Claude product: Claude.ai (Pro, Max, Team, Enterprise), the Messages API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

The API model identifier is claude-opus-4-7. Pricing is identical to Opus 4.6 — $5 per million input tokens and $25 per million output tokens — with up to 90% savings through prompt caching and 50% savings with batch processing. The context window remains at 1 million tokens.

Anthropic has been on a two-month cadence: Opus 4.5 shipped in late 2025, Opus 4.6 arrived in February 2026, and now Opus 4.7 lands in mid-April. Each release has represented a meaningful jump in coding and agentic performance — but this one carries extra weight because it arrives in the middle of a user revolt over Opus 4.6 quality degradation and growing questions about the unreleased Mythos model.

Benchmarks: Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro

The benchmark story is straightforward: Opus 4.7 leads on most coding and agentic tasks, trades blows with GPT-5.4 on graduate-level reasoning, and extends Anthropic's lead in the metrics developers actually care about.

Agentic Coding (SWE-bench Pro)

Model	Score
Claude Opus 4.7	64.3%
GPT-5.4	57.7%
Gemini 3.1 Pro	54.2%
Claude Opus 4.6	53.4%

Agentic Coding (SWE-bench Verified)

Model	Score
Claude Opus 4.7	87.6%
Claude Opus 4.6	80.8%
Gemini 3.1 Pro	80.6%

Graduate-Level Reasoning (GPQA Diamond)

Model	Score
GPT-5.4 Pro	94.4%
Gemini 3.1 Pro	94.3%
Claude Opus 4.7	94.2%

Other Notable Results

CursorBench (coding IDE): Opus 4.7 hits 70% vs 58% for Opus 4.6 — a 12-point jump on the benchmark Cursor uses internally.
BigLaw Bench (Harvey legal reasoning): Opus 4.7 scores 90.9% at high effort with better reasoning about distinguishing legal provisions.
CodeRabbit's 93-task benchmark: Opus 4.7 lifted resolution by 13% over Opus 4.6 and solved 4 tasks that neither Opus 4.6 nor Sonnet 4.6 could crack.
Rakuten-SWE-Bench: Opus 4.7 resolves 3x more production tasks than Opus 4.6 with double-digit gains in code quality and test quality.

Anthropic flagged a subset of SWE-bench problems for potential memorization. Excluding those problems, Opus 4.7's margin over Opus 4.6 holds. The one benchmark where Opus 4.7 trails is GPQA Diamond — GPT-5.4 Pro edges it by 0.2 percentage points. On virtually everything else related to coding and agentic work, Opus 4.7 is the current leader.

New Feature: xhigh Effort Level

Opus 4.7 introduces xhigh, a new effort level slotted between high and max. This gives developers finer control over the reasoning-versus-latency tradeoff.

The effort spectrum is now: low → medium → high → xhigh → max.

Anthropically recommends starting with high or xhigh for coding and agentic use cases. Use xhigh when you need stronger reasoning without paying the full latency cost of max — debugging complex issues, architectural design, security analysis, and multi-step refactors are all good candidates.

In Claude Code, activate it with:

bash

# In-session
/effort xhigh

# CLI flag
claude --effort xhigh

# Environment variable
export CLAUDE_CODE_EFFORT_LEVEL=xhigh

Claude Code now defaults to xhigh across all plans, which is a direct response to the effort-level controversy where the silent downgrade from high to medium caused the 67% thinking-depth collapse.

New Feature: Task Budgets (Public Beta)

Task budgets solve one of the biggest pain points in agentic AI development: unbounded token consumption during long-running loops.

The feature lets developers set a rough token target for an entire agentic loop — covering thinking tokens, tool calls, tool results, and output. The model sees a running countdown and adapts its behavior accordingly: it prioritizes high-value work, skips low-impact steps, and finishes gracefully as the budget runs out instead of cutting off mid-task.

This is a game-changer for teams running Claude in production pipelines — framework migrations, large-scale refactors, and multi-file code generation all benefit from predictable cost ceilings. Before task budgets, the only option was to kill the process and lose partial progress, or let it run and hope the bill was reasonable.

New Feature: High-Resolution Vision (3.75 Megapixels)

Opus 4.7 processes images at up to 2,576 pixels on the long edge — roughly 3.75 megapixels, more than 3x the resolution of any prior Claude model. Anthropic reports 98.5% visual acuity.

This is not an incremental improvement — it fundamentally changes what Claude can do with visual inputs:

Document processing: .docx, .pptx, charts, and dense PDFs with fine print are now readable at near-human fidelity.
UI generation: Because Opus 4.7 can actually see design details at full resolution, it generates significantly better slides, interfaces, and mockups.
Computer use: 1:1 pixel coordinate mapping means screen-interaction agents can click precisely where they intend to, reading dense UIs that previous models blurred.
Diagram extraction: Architecture diagrams, flowcharts, and technical schematics are now parseable without losing fine details.

For teams doing document-heavy work or building computer-use agents, the vision upgrade alone may justify the migration.

New Feature: /ultrareview Command in Claude Code

Claude Code ships a new /ultrareview slash command alongside Opus 4.7. This is not a simple code check — it produces a dedicated, structured review session that analyzes architecture, security, performance, and maintainability across your changes.

The command flags issues that a careful human reviewer would catch: subtle bugs, design smell, performance regressions, security vulnerabilities, and maintainability concerns. It goes well beyond what the existing /review command covers.

Pro and Max Claude Code users receive 3 free ultrareviews to try it out. Given the depth of analysis it performs, this sits somewhere between an automated linter and a senior engineer's code review.

Better Instruction Following — and Why It Is a Breaking Change

Opus 4.7 follows instructions more literally than Opus 4.6. Anthropic explicitly flags this as a potential breaking change.

In Opus 4.6, vague or loosely-worded prompts often worked because the model would interpret intent generously. Opus 4.7 does what you actually said. If your prompt says "list three items" and you meant "list some items," you will get exactly three — not five or seven.

For most developers, this is an improvement. For anyone running production prompts tuned for Opus 4.6's looser interpretation, it means re-testing every prompt before migrating. Anthropic recommends auditing prompts that rely on implicit understanding rather than explicit instructions.

Self-Verification and File System Memory

Two capabilities that matter most for long-running agentic work:

Self-verification. Opus 4.7 devises ways to verify its own outputs before reporting back. In autonomous coding sessions, this means it will run tests, check outputs, and confirm results rather than assuming its first attempt succeeded. It also correctly reports when data is missing instead of hallucinating plausible fallbacks — Anthropic says it resists "dissonant-data traps" that even Opus 4.6 falls for.

File system memory. The model is better at using file-based memory across long, multi-session work. It remembers important notes from previous sessions and uses them to move on to new tasks with less up-front context. This matters for anyone using Claude Code with CLAUDE.md files and project-level memory — the model actually reads and applies that context more reliably now.

The Tokenizer Change: Hidden Cost Increase

Opus 4.7 uses an updated tokenizer. The per-token price is identical to Opus 4.6, but the same input text can map to 1.0–1.35x more tokens depending on content type.

This means your effective cost per request can increase even though the price sheet has not changed. Code-heavy inputs may see a larger token count increase than natural language.

Additionally, Opus 4.7 thinks more at higher effort levels, generating more output tokens. Anthropic notes that "low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6" — meaning even baseline usage may consume more tokens than before.

Before migrating production traffic, test with the /v1/messages/count_tokens endpoint to see how your specific payloads are affected. Run cost projections against actual request volumes. The token increase is manageable for most workloads, but teams with high-volume API usage should model the impact before switching.

The Mythos Elephant in the Room

Anthropic publicly conceded something unusual: Opus 4.7 does not match the performance of Claude Mythos Preview, the model we covered in our Mythos / Project Glasswing deep dive.

Mythos Preview is only available to handpicked companies through Project Glasswing, Anthropic's cybersecurity initiative. Opus 4.7 had its cyber capabilities deliberately reduced during training — Anthropic experimented with "differentially reducing" these capabilities and deployed automatic safeguards that detect and block prohibited or high-risk cybersecurity requests.

The strategy is explicit: learn from deploying cyber safeguards on Opus 4.7, then use those lessons to eventually release Mythos-class models to the general public. Anthropic also launched a Cyber Verification Program for security professionals who want to use Opus 4.7 for legitimate penetration testing and red-teaming.

This is a notable shift in transparency. Rather than quietly releasing a model that is "not Mythos," Anthropic is openly discussing the gap and the roadmap to close it. Expect broader Mythos access later in 2026.

Context: The Opus 4.6 Nerfing Controversy

This release does not exist in a vacuum. It arrives amid weeks of user complaints that Opus 4.6 had quietly gotten worse.

The highest-profile case: Stella Laurenzo, Senior Director of AI at AMD, analyzed 6,852 Claude Code sessions and published a forensics report showing thinking depth had dropped roughly 67%. Her conclusion — "Claude cannot be trusted to perform complex engineering tasks" — was picked up by The Register, WinBuzzer, and InfoWorld. AMD switched AI coding providers.

Users on Reddit and GitHub speculated that Claude was deliberately "nerfed" to cut costs or redirect compute to Mythos training. Anthropic denied the quality changes were related to resource allocation.

Opus 4.7 is, in many ways, Anthropic's answer to that backlash. The xhigh default effort, the self-verification capabilities, and the improved instruction following all directly address the specific complaints that dominated the Opus 4.6 era.

Early Tester Feedback

Several companies tested Opus 4.7 before launch. Here is what they reported:

CodeRabbit called it the sharpest model they have tested, with recall improving over 10% on code review workloads. They noted it was faster than GPT-5.4 at xhigh effort.
Genspark reported the best loop resistance, consistency, and error recovery of any model they have evaluated — the highest quality-per-tool-call ratio they have measured.
Warp found it measurably more thorough than Opus 4.6, passing Terminal Bench tasks that prior Claude models had failed and working through a tricky concurrency bug.
Hex described it as the strongest model they have evaluated. They confirmed that low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6, and that it resists dissonant-data traps that caught previous models.
Harvey validated the 90.9% BigLaw Bench score, describing the responses as correct, thorough, and well-cited with better reasoning about distinguishing legal provisions.
Cursor measured 70% on CursorBench versus 58% for Opus 4.6 — a meaningful 12-point jump in their internal coding IDE benchmark.
Box reported 14% improvement over Opus 4.6 at fewer tokens and one-third the tool errors. They noted it was the first model to pass their implicit-need tests.
Intuit found it catches logical faults during planning and accelerates execution, resolving 3x more production tasks than Opus 4.6.

Claude Code Updates Bundled with Opus 4.7

Several Claude Code changes ship alongside the model:

Auto mode is now available for Max plan subscribers. Previously limited to Teams, Enterprise, and API users, Auto mode lets Claude make autonomous decisions with fewer interruptions during coding sessions.
`/ultrareview` — the new dedicated code review command with 3 free sessions for Pro/Max users.
Default reasoning level has been adjusted. Claude Code defaults to xhigh effort across all plans.
Update with: claude update or npm install -g @anthropic-ai/claude-code@latest
Select the model with: /model opus — the alias automatically points to the latest Opus release.

If you were running the workarounds from the 67% degradation issue (CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING=1 and /effort max), you should re-test with default settings on Opus 4.7 before deciding whether to keep those overrides.

Availability, Migration, and Breaking Changes

Available today on: Claude.ai, the Messages API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

GitHub Copilot: Rolling out to Copilot Pro+, Business, and Enterprise users across VS Code, Visual Studio, JetBrains, Xcode, and github.com. Opus 4.7 launches with a 7.5x premium request multiplier (promotional pricing through April 30, 2026). It will replace Opus 4.5 and 4.6 in the Copilot model picker over the coming weeks.

Breaking changes to watch for:

Extended thinking budgets removed — the API no longer accepts this parameter.
Sampling parameters removed — adjust your API calls accordingly.
New tokenizer — same text maps to 1.0–1.35x more tokens.
Stricter instruction following — vague prompts that worked on Opus 4.6 may produce unexpected results.

The API identifier is claude-opus-4-7. If you are pinned to claude-opus-4-6, your code will continue to work until Anthropic deprecates that identifier — but you should begin migration testing now.

What This Means for Your Stack

If you use Claude Code: Update immediately with claude update. Switch to /model opus and test xhigh effort on your hardest tasks. The default settings are already better than the max-effort workarounds most developers were running on 4.6.

If you are on the API: Test your prompts with the new tokenizer before migrating production traffic. Use /v1/messages/count_tokens to project cost changes. Review any prompts that relied on Opus 4.6's loose interpretation of instructions.

If you build coding agents: Task budgets are the headline feature for you. Set token ceilings on long-running loops instead of crossing your fingers on the bill. Combine with xhigh effort for the best quality-per-token ratio.

If you process documents: The 3.75-megapixel vision upgrade is a step change for .docx, .pptx, and chart analysis. Test your document pipelines — you may be able to simplify preprocessing steps that were compensating for lower-resolution parsing.

If Opus 4.6 frustrated you: This is the fix. The benchmarks back it up, the early testers confirm it, and the architectural changes (xhigh default, self-verification, better memory) directly address the specific failure modes that made Opus 4.6 unreliable.

The Mythos factor: Anthropic is sitting on something more powerful. The Cyber Verification Program and Project Glasswing are the path to broader access. Expect Mythos-class capabilities to reach general availability later in 2026.

For broader context on the AI landscape, see the Stanford AI Index 2026 report published today, our coverage of GPT-5.4 Computer Use, and the best AI coding tools for 2026.

Competitive Landscape

The top of the leaderboard in April 2026 is a three-way race:

Anthropic leads Arena rankings as of March 2026. Opus 4.7 extends the gap on coding and agentic tasks. 8 of the Fortune 10 are Claude customers.
OpenAI trades blows with GPT-5.4, which edges ahead on GPQA Diamond reasoning and has its own computer use capabilities. OpenAI is also pursuing the SPUD superintelligence initiative.
Google holds its own with Gemini 3.1 Pro, particularly on multilingual benchmarks, though it trails on the coding-specific evals.

Anthropic's growth numbers tell the broader story: Claude traffic has grown roughly 5x over the past year, and the company raised $30 billion at a $380 billion valuation in February 2026. The pace of releases — three Opus models in six months — shows no sign of slowing.

Explore all 50+ free developer and AI tools on DevPik.

🛠️ Try It Yourself

Put what you've learned into practice with our free tools:

Code Share JSON Formatter

Frequently Asked Questions

Is Claude Opus 4.7 free?▾

Claude Opus 4.7 is available to Claude Pro and Max subscribers on Claude.ai at no additional cost beyond the subscription. On the API, it is pay-per-token at $5 per million input tokens and $25 per million output tokens. There is no free tier for Opus models — the free Claude.ai tier uses Sonnet, not Opus.

How much does Claude Opus 4.7 cost?▾

Claude Opus 4.7 costs $5 per million input tokens and $25 per million output tokens — identical to Opus 4.6. Prompt caching saves up to 90%, and batch processing saves 50%. However, the updated tokenizer means the same input text may map to 1.0–1.35x more tokens, so effective cost per request can increase slightly.

Is Claude Opus 4.7 better than GPT-5.4?▾

On most coding and agentic benchmarks, yes. Opus 4.7 scores 64.3% on SWE-bench Pro versus 57.7% for GPT-5.4, and 87.6% on SWE-bench Verified. On graduate-level reasoning (GPQA Diamond), GPT-5.4 Pro edges ahead at 94.4% versus 94.2% for Opus 4.7. The answer depends on your workload — Opus 4.7 leads on coding, GPT-5.4 trades blows on reasoning.

What is the difference between Claude Opus 4.7 and Mythos?▾

Claude Mythos Preview is Anthropic's most powerful model but is not generally available — it is only accessible to selected companies through Project Glasswing, a cybersecurity initiative. Opus 4.7 had its cyber capabilities deliberately reduced during training. Anthropic is using Opus 4.7 to test safety safeguards before eventually releasing Mythos-class models to the public.

How do I upgrade to Claude Opus 4.7?▾

For Claude Code, run 'claude update' or 'npm install -g @anthropic-ai/claude-code@latest', then use '/model opus' to select it. For the API, change your model identifier to 'claude-opus-4-7'. On Claude.ai, Pro and Max subscribers can select Opus 4.7 from the model picker. On GitHub Copilot, it is rolling out to Pro+, Business, and Enterprise plans.

Will my Claude Opus 4.6 prompts work with Opus 4.7?▾

Mostly yes, but Anthropic warns that Opus 4.7 follows instructions more literally than 4.6. Prompts that relied on loose interpretation may produce different results. Extended thinking budget and sampling parameters have been removed from the API. Test your critical prompts before migrating production traffic.

What is the xhigh effort level in Claude?▾

xhigh is a new effort level between 'high' and 'max' introduced with Opus 4.7. It gives developers finer control over the tradeoff between reasoning depth and response speed. Anthropic recommends xhigh for coding and agentic tasks. In Claude Code, activate it with '/effort xhigh' or set the CLAUDE_CODE_EFFORT_LEVEL=xhigh environment variable.

What are task budgets in Claude Opus 4.7?▾

Task budgets are a public beta feature that lets developers cap token spending during long-running agentic loops. You give Claude a rough token target for the entire loop (thinking + tool calls + tool results + output), and the model sees a running countdown, prioritizes high-value work, and finishes gracefully as the budget runs out. This prevents unbounded costs in multi-turn agent workflows.

Written by

Muhammad Tayyab

CEO & Founder at Mergemain

Muhammad Tayyab builds free, privacy-first developer tools at DevPik. He writes about AI trends, developer tools, and web technologies.

LinkedIn View all articles

base64encoding

What Is Base64 Encoding? A Complete Guide for Developers

Base64 encoding converts binary data into ASCII text. Learn how it works, common use cases in web development, and how to encode and decode Base64 strings instantly.

jsonformatting

JSON Formatting Best Practices: How to Read & Debug JSON Data

JSON is the backbone of modern APIs. Learn best practices for formatting, validating, and debugging JSON data to write cleaner code and fix errors faster.

word countseo

The Ultimate Guide to Word Count: Why It Matters for SEO & Writing

Word count impacts SEO rankings, reader engagement, and content quality. Learn the ideal word counts for different content types and how to count words accurately.

uuidunique identifiers

Understanding UUIDs: What They Are and When to Use Them

UUIDs provide unique identifiers without a central authority. Learn about UUID versions, use cases in databases and APIs, and generate them instantly with our free tool.