Case StudiesFeatured

Claude Sonnet 5 Review: Benchmarks, Cost Per Task & How It Compares to Opus 4.8

Claude Sonnet 5 scores 53 on the Artificial Analysis Intelligence Index and matches Opus 4.8 on agentic tasks — but at standard pricing it costs more per task than the model it's supposed to undercut. Here's what the numbers actually show.

bhargavjoshi1237July 1, 20266 min read

Claude Sonnet 5 Review: Benchmarks, Cost Per Task & How It Compares to Opus 4.8

Claude Sonnet 5 Is a Strong Model With a Cost Problem Nobody's Talking About

Anthropic released Claude Sonnet 5 on June 30, and the benchmark numbers from Artificial Analysis are worth going through carefully — because the headline score is genuinely good, but the cost story is more complicated than the launch framing suggests.

The short version: Sonnet 5 scores 53 on the Artificial Intelligence Index, which puts it at #5 overall. That's a 6-point improvement over Sonnet 4.6 and puts it level with GPT-5.5 at high reasoning. Solid progress. But at max effort, it costs more per task than Opus 4.8 — a model that, by most accounts, is supposed to be the more premium tier.

That tension is worth unpacking.

What the Score Actually Means

Sonnet 5 at 53 sits 2-3 points behind GPT-5.5 at xhigh reasoning and Claude Opus 4.8 at max. For a mid-tier model in the Anthropic lineup, that's a meaningful result. It's not leading the pack, but it's no longer a clear step down from frontier either.

The gains over Sonnet 4.6 are concentrated in a few places. According to Artificial Analysis's evaluation, it picked up 10 points on Humanity's Last Exam, 9 points on TerminalBench v2.1, and 7 points on SciCode. On CritPt — a frontier physics reasoning benchmark developed by researchers at Argonne and UIUC — it scores 17%, which is 14 points higher than Sonnet 4.6. That sounds good until you see that GLM-5.2, Opus 4.8, and GPT-5.5 are all ahead of it on that benchmark. The hard reasoning ceiling is still there; Sonnet 5 just pushed closer to it.

Everywhere else, scores are mostly flat. The improvements are real, but they're targeted — not a broad capability jump across the board.

The Agentic Story Is Where It Gets Interesting

On agentic knowledge work benchmarks, Sonnet 5 actually trades blows with Opus 4.8. Artificial Analysis tested both models on AA-Briefcase and GDPval-AA — evaluations that test multi-step professional task completion using their open-source Stirrup agent harness. On both, Sonnet 5 sits just ahead of Opus 4.8, trailing only Claude Fable 5 (which isn't publicly available yet).

That's a notable result. The model that's supposed to be the "efficient" option is outperforming the flagship on real-world agentic tasks. If your workload is primarily long-horizon knowledge work — research tasks, multi-step document workflows, extended agent sessions — Sonnet 5 is genuinely worth considering over Opus 4.8.

The way it gets there matters, though.

It Works Harder to Get Those Results

Sonnet 5 at max effort uses roughly 40% more output tokens per Intelligence Index task than Sonnet 4.6 did. On GDPval-AA, it uses about 3x the agentic turns compared to Sonnet 4.6. Artificial Analysis also found that the behavior scales predictably with the effort setting — max effort uses around 6x more turns than low effort on GDPval-AA.

The new xhigh effort setting (matching the five effort levels now available on Opus 4.8) gives you more control over this. At lower effort settings the model is cheaper; at max effort it's expensive. That tradeoff is more explicit than it used to be, which is good for cost management, but it also means you need to think about which effort level you're actually evaluating against when you compare benchmarks.

The Cost Math Is the Real Story

Here's where it gets uncomfortable: Artificial Analysis ran their evaluation at standard $3/$15 per 1M input/output token pricing, and Sonnet 5 comes out at $2.29 per Intelligence Index task. That's roughly double what Sonnet 4.6 cost per task, and about 15% more expensive than Opus 4.8.

Anthropic is currently running a promotional discount — $2/$10 per 1M tokens until September 1 — which softens this significantly. But the promotional pricing is temporary, and Artificial Analysis's numbers reflect what you'll pay once it expires. Token pricing is identical to Sonnet 4.6 ($3/$15 vs $5/$25 for Opus). The cost increase is driven entirely by the model using more tokens, not by higher per-token rates.

Cache pricing stays the same: 25% premium for cache writes at $3.75/M with a 5-minute TTL, and 90% discount for cache hits at $0.30/M.

Context window is 1M tokens, unchanged from Sonnet 4.6.

How to Think About This

If you're evaluating Sonnet 5 against Sonnet 4.6 at promotional pricing, it looks like a strong upgrade — more capable, same price range. If you're evaluating it at standard pricing against Opus 4.8, it's a harder sell on pure cost grounds, even though it matches or beats Opus on agentic benchmarks.

The model that makes the clearest case for Sonnet 5 is someone running long-horizon agent workloads who currently uses Opus 4.8. On those specific tasks, Sonnet 5 performs comparably at a lower token rate — even if the increased turn count partially offsets that advantage.

For heavy reasoning and knowledge tasks outside the agentic context — think deep research, complex science problems, or anything CritPt-adjacent — Opus 4.8 still has the edge, and the gap is meaningful enough that it's not a coin flip.

Specs at a Glance

Intelligence Index score: 53 (max effort)
Ranking: #5 overall per Artificial Analysis
Pricing: $3/$15 per 1M input/output tokens (promotional $2/$10 until September 1)
Cache writes: $3.75/M (25% premium, 5-minute TTL)
Cache hits: $0.30/M (90% discount)
Context window: 1M tokens
Effort settings: low, medium, high, xhigh, max (new xhigh level vs Sonnet 4.6)
Cost per task at standard pricing: ~$2.29 (per Artificial Analysis evaluation)

Bottom Line

Sonnet 5 is a real improvement over Sonnet 4.6, and its agentic performance is legitimately impressive relative to where Anthropic's mid-tier model used to sit. But the cost-per-task increase is significant and gets papered over somewhat by the launch pricing. Anyone building production workloads on this model should run their own cost projections at standard rates rather than assuming the promotional price reflects the long-term economics.

If agentic knowledge work is your primary use case and you're currently on Opus 4.8, Sonnet 5 is worth a serious evaluation. If you need the best reasoning performance possible and cost is secondary, Opus 4.8 still holds that position.