Jun 28, 2026

Terminal-Bench 2.1 in June 2026: The #1 Model Is One You Can't Use — Here's the Leaderboard That Actually Matters

By AICoderScope Team · 10 min read

terminal-benchclaude-codecodex-cligpt-5-5comparisonbenchmark

TL;DR: Claude Fable 5 leads Terminal-Bench 2.1 at 88.0% — the first model to break 85% — but it’s been offline under a US export-control order since June 12 and was not restored as of June 27. Among tools you can actually pay for and run today, Codex CLI on GPT-5.5 (83.4%) edges Claude Code on Opus 4.8 (82.7%). That 0.7-point gap is noise. Don’t switch your stack over it.

	Codex CLI + GPT-5.5	Claude Code + Opus 4.8	Claude Code + Fable 5
Best for	Long-context, autonomous runs	Interactive, plan-first work	Nobody — it’s suspended
Terminal-Bench 2.1 (native harness)	83.4%	82.7%	88.0%
Price (per M tokens)	$5 in / $30 out	$5 in / $25 out	$10 in / $50 out
Availability today	Full	Full	Suspended (US export order)

Honest take: The benchmark’s real lesson in June 2026 isn’t “which model wins.” It’s that the model winning by 4.6 points is one you legally cannot access. Pick your tool on harness fit, cost, and availability — not on a leaderboard whose top row is unbuyable.

The number everyone is quoting is unbuyable

The headline result from Terminal-Bench 2.1 this month reads great: Claude Fable 5 hit 88.0%, the first model ever to clear 85% on the benchmark, finishing 4.6 points ahead of GPT-5.5. Anthropic’s Mythos-class model looked like the new ceiling for agentic coding.

Then on June 12, three days after Fable 5 launched, the US government issued an export-control directive ordering Anthropic to suspend all access to Fable 5 and Mythos 5. The models went dark globally — for every customer, on every platform. As of June 27, the government allowed Mythos 5 back for a narrow set of US critical-infrastructure organizations. Fable 5 was not restored. Anthropic says it is working to bring it back “as soon as possible,” with no date.

So the model topping Terminal-Bench 2.1 is one you can’t use to write code today. That single fact reframes the entire leaderboard. If you’re choosing a coding agent in late June 2026, the 88.0% row is trivia. The decision is between the next two rows — both of which you can install and pay for right now.

The leaderboard that actually matters

Strip out the suspended models and Terminal-Bench 2.1’s native-harness leaderboard (tbench.ai, which runs each agent in its own tooling) looks like this for tools you can buy today:

Rank (usable tools)	Agent	Model	Score
1	Codex CLI	GPT-5.5	83.4%
2	Claude Code	Opus 4.8	82.7%
3	Gemini CLI	Gemini 3.1 Pro	~70.7%

GPT-5.5 in Codex CLI leads the usable field by 0.7 points over Opus 4.8 in Claude Code. On a benchmark of 31 models running real multi-step terminal tasks — package installs, git operations, build fixes, server config — a sub-one-point gap is measurement noise. Run the suite again next week and the order could flip. Neither tool is meaningfully “better at coding” than the other based on this.

The third-place drop to Gemini CLI is the more interesting signal: roughly 12 points back. The top two are in a different class from everything else you can run.

Why the same model scores two different numbers

Here’s the part most leaderboard coverage skips, and it changes how you should read every score above.

Terminal-Bench publishes two views. The tbench.ai leaderboard runs each agent in its native harness — Codex CLI wraps GPT-5.5 the way OpenAI built it, Claude Code wraps its models the way Anthropic built them. That’s an apples-to-apples tool comparison. It answers “which product, as shipped, completes the most tasks.”

The vals.ai leaderboard runs every model through the same harness (Terminus 2). That’s an apples-to-apples model comparison. It strips the tooling out and asks “which raw model is strongest.”

Put them side by side and the gap is impossible to ignore:

Model	tbench.ai (native harness)	vals.ai (Terminus 2, uniform)
Claude Fable 5	88.0%	80.52%
GPT-5.5	83.4% (Codex CLI)	76.40%
Gemini 3.5 Flash	—	74.16%
Claude Opus 4.8	82.7% (Claude Code)	71.91%

GPT-5.5 scores 83.4% inside Codex CLI but only 76.40% inside Terminus 2 — a 7-point swing. The model didn’t change. The agent loop wrapping it did. That gap is the harness: how the tool plans, retries failed commands, manages context, and decides when a task is done.

Seven points is larger than the entire margin between the top two tools. The practical takeaway is blunt: the harness can matter more than the model. A strong model in a mediocre agent loop loses to a slightly weaker model in a well-engineered one. When you pick a coding agent, you’re not picking a model — you’re picking a model plus the software that drives it, and the driving is doing a lot of the work.

This also explains why you should distrust any single score quoted without its harness. “GPT-5.5 gets 83.4% on Terminal-Bench 2.1” is true and “GPT-5.5 gets 76.40% on Terminal-Bench 2.1” is also true. Both describe the same model on the same benchmark version. Always ask which harness produced the number.

What Terminal-Bench actually tests (and what it doesn’t)

Terminal-Bench measures an agent driving a real terminal to finish a task: edit files, run shell commands, read the failures, fix them, repeat until the task passes. Version 2.1 is harder than 2.0 — the tasks are longer and more sequential — so scores are not comparable across versions. A model’s 2.0 number tells you nothing about its 2.1 standing.

What it captures well: sequential, multi-step work where one wrong command derails the next three. That’s closer to real agentic coding than SWE-bench’s one-shot bug fixes, which is why Terminal-Bench has become the headline benchmark for CLI agents in 2026.

What it doesn’t capture: tool latency, session memory across hours of work, IDE integration, how good the diffs are to review, or how often the agent quietly does the wrong thing confidently. A 0.7-point benchmark lead says nothing about whether you’ll enjoy using the tool for eight hours. Those qualities decide daily satisfaction, and no leaderboard scores them.

Cost: where the real decision lives

With the top two tools tied on capability, price and fit decide it. Here’s the verified API pricing as of June 28, 2026:

Model	Input / M	Output / M	Notes
GPT-5.5	$5	$30	Cached input $0.50/M; batch & flex 50% off → $2.50 / $15
Claude Opus 4.8	$5	$25	Standard tier; “fast” tier is $10 / $50
Claude Fable 5	$10	$50	Suspended — not purchasable
GPT-5.5 Pro	$30	$180	Heavy-reasoning variant

Opus 4.8 is cheaper on output ($25 vs $30/M), which dominates the bill for agentic work that generates a lot of code and tool calls. GPT-5.5 claws some of that back with a low $0.50/M cached-input rate, which helps long sessions that re-send the same context repeatedly. For most real workloads the two land within a few dollars of each other per heavy session.

If you’d rather pay flat-rate than metered, both tools have subscription paths: Codex CLI runs on ChatGPT Plus ($20/mo) and Claude Code starts at $20/mo Pro. For the break-even math between flat and metered plans, see our Claude Code vs Codex CLI comparison and the 7-way agent comparison.

The lesson that outlasts this month’s scores

The benchmark order will shuffle. Fable 5 may come back. GPT-5.6 is already shipping. Six months from now these exact numbers are history.

What won’t change is the structural lesson Fable 5’s suspension just taught: a cloud model can vanish overnight by government order, with three days’ notice between launch and shutdown. Developers who built a workflow around Fable 5 in its first week lost it in its second. If your stack has a single point of failure — one model, one vendor, one API — you’re one directive away from a bad afternoon.

The defensive move is the same one that protects you from price hikes and rate-limit changes: keep your tooling model-agnostic. Codex CLI and Claude Code both let you fail over. And the only backend no directive can switch off is one running on your own hardware — a local model via OpenCode + Ollama costs nothing per token and answers to no export order. It won’t top Terminal-Bench, but it’s still there on the day the leaderboard champion gets pulled. If you’re spec’ing a machine for that fallback, runaihome.com’s local-AI hardware guides cover what VRAM each model class needs.

For the full breakdown of the Fable 5 shutdown and how to build a resilient fallback chain, see our coding-stack resilience guide.

The verdict

Pick Codex CLI on GPT-5.5 if you want the highest usable Terminal-Bench score, a large context window, and an autonomous, sandboxed agent loop for hands-off runs. Pick Claude Code on Opus 4.8 if you want lower output cost, a plan-first interactive loop, and the better tier ladder. The 0.7-point benchmark gap should not be the deciding factor — your workflow shape and your bill should be.

Do not pick a tool because Fable 5 scored 88%. You can’t use it. And the next time a benchmark crowns a clear winner, check that you can actually buy the winner before you rearchitect anything around it.

FAQ

What’s the difference between the tbench.ai and vals.ai Terminal-Bench scores? tbench.ai runs each agent in its own native harness (a tool comparison); vals.ai runs every model through the same Terminus 2 harness (a model comparison). The same model can score 7 points apart between them. Neither is “wrong” — they answer different questions.

Is Claude Fable 5 available right now? No. It was suspended on June 12, 2026 under a US export-control directive. As of June 27, only Mythos 5 was partially restored, and only for specific US critical-infrastructure organizations. Fable 5 has no confirmed return date.

Should I switch from Claude Code to Codex CLI for the 0.7-point lead? No. A sub-one-point gap on Terminal-Bench 2.1 is within run-to-run noise. Choose on cost, context window, agent-loop style, and platform support instead.

Why is Terminal-Bench 2.1 better than SWE-bench for judging coding agents? It measures sequential, multi-step terminal work — install, run, read failures, fix, repeat — which matches how agentic tools actually operate. SWE-bench leans toward one-shot bug fixes. They’re complementary, but 2.1 is closer to daily agent behavior.

Are 2.0 and 2.1 scores comparable? No. Version 2.1 uses harder, longer tasks. A model’s 2.0 score tells you nothing about its 2.1 ranking; only compare within the same version.

Sources

Last updated June 28, 2026. Pricing and features change frequently; verify current state before purchasing.

Was this article helpful?