Devstral 2 + Mistral Vibe CLI review 2026: open-source coding agent on a single RTX 4090

devstralmistrallocal-llmreviewopen-sourceclicomparisonpricing

TL;DR: Devstral Small 2 (24B) is the most capable open-weight coding model you can run on a single consumer GPU — 68% SWE-bench Verified on a $700 RTX 4090, zero API fees. The full 123B model needs four H100s, but at $0.40/$2.00 per million tokens it undercuts Claude Sonnet by 7x. Mistral Vibe CLI wraps both with a free, open-source terminal agent that now defaults to the even stronger Medium 3.5. If you want Claude Code-level capability without the Anthropic lock-in, the Vibe + Devstral stack is the first open alternative worth taking seriously.

Devstral Small 2 (local)Devstral 2 via APIClaude Sonnet 4.5 via API
Best forPrivacy, zero ongoing cost, experimentationCost-sensitive teams at scaleHighest single-task accuracy
Price / Cost~$700 GPU, then free$0.40 in / $2.00 out per 1M tokens$3.00 in / $15.00 out per 1M tokens
SWE-bench Verified68.0%72.2%77.2%
The catchNeeds RTX 4090 (24GB VRAM); 256k context requires multi-GPUCloud-only; free tier endsHighest per-token cost in class

Honest take: If you already own an RTX 4090, run Devstral Small 2 locally — the cost and privacy math wins. If you don’t, pay Mistral API rates instead of Anthropic’s and accept a 5-point accuracy trade-off. Only reach for Claude Sonnet when the task genuinely needs that last 5%.


Two models, one confusing name

Mistral released Devstral 2 on December 9, 2025 as a two-model family:

  • Devstral 2 (123B parameters): the full flagship. Scores 72.2% on SWE-bench Verified. Ships under a modified MIT license with a commercial revenue cap of $20M/month. API-only for most users.
  • Devstral Small 2 (24B parameters): the local-friendly version. Scores 68.0% on SWE-bench. Apache 2.0 — genuinely permissive, no revenue cap. Fits on a single RTX 4090.

Both carry a 256k token context window, which is the largest available in any open-weight coding model in this class. That context depth matters for agentic work: you can feed Devstral an entire Python package or a multi-file refactor without truncating anything.

The naming is a minor mess. “Devstral 2” usually refers to the 123B model, but search results, documentation, and blog posts use it interchangeably for both. When you’re reading benchmark comparisons, always check the parameter count.


What 72.2% SWE-bench actually means

SWE-bench Verified tests a model’s ability to fix real GitHub issues from open-source Python repositories — not toy prompts, but actual PR-quality patches. At 72.2%, Devstral 2 resolves roughly 200 out of 277 test cases. Claude Sonnet 4.5 resolves about 214 (77.2%), meaning Devstral 2 misses ~14 more issues.

That 5-point gap is small enough to be irrelevant in most production workflows. The majority of real-world tasks — adding a feature, fixing a TypeError, writing a test suite — sit well below SWE-bench difficulty. Where the gap shows up is in the hardest 10%: legacy codebases with implicit invariants, multi-file refactors that require reasoning across 50 files, or bugs caused by subtle state management.

Devstral Small 2’s 68.0% is more interesting than it looks. No other open-weight model of its size comes close. It’s 4 points below the full Devstral 2, meaning you give up 11 more solved issues at 24 billion parameters instead of 123 billion. If you’re running it for free on local hardware, that trade-off is not a trade-off at all — you’re comparing $0/month versus whatever the next tier costs.

Mistral’s cost-efficiency claim — “7x cheaper than Claude Sonnet” — checks out on paper. At 100 million output tokens per day (a realistic high-volume enterprise figure), Devstral 2 at $2.00/M output costs $200/day versus Claude Sonnet’s $1,500. Annually that’s $474,000 in savings. The math only works at scale, but it’s real math.


Running Devstral Small 2 on an RTX 4090

The headline claim is true, with one condition: you need Q4_K_M quantization, not full precision.

At Q4_K_M, the 24B model loads into roughly 15GB of VRAM — comfortably inside an RTX 4090’s 24GB. At full Q8 precision, the requirement climbs to ~26GB, which exceeds a 4090 by a meaningful margin. The practical path is Q4_K_M via Ollama:

ollama pull devstral-small-2

Ollama 0.13.3 or newer is required; the current documented baseline is 0.23.4 (May 13, 2026). The 4-bit model occupies about 15GB on disk, and you’ll want at least 32GB of system RAM for context headroom when approaching the 256k limit.

Mac users can run the same Q4_K_M model on any Apple MacBook Pro M4 Max with 32GB unified memory. The inference speed is roughly equivalent to an RTX 4090 at this quantization level — both fall in the 20–35 tokens/second range depending on context length.

For Q6_K or better at full 256k context, you need 35GB+ VRAM — that means an RTX A6000, 6000 Ada, or a 64GB+ Mac. Worth it for long agentic sessions; overkill for interactive chat.

For a deep comparison of which hardware tiers actually make sense for local AI coding inference, see the Cursor + Local Llama hardware tiers breakdown and the full local LLM hardware guide at runaihome.com.


Running the full 123B via API

The full Devstral 2 requires four H100-class GPUs at FP8 precision — roughly 130GB of VRAM. That’s a $15,000+/month cloud bill if you’re renting dedicated capacity. Most developers will not self-host it.

Via the Mistral API, pricing at the time of writing is:

  • Devstral 2 (123B): $0.40/M input, $2.00/M output
  • Devstral Small 2 (24B): $0.10/M input, $0.30/M output

The 123B model is currently free during its promotional launch window. Once paid pricing kicks in, the value proposition remains strong for teams doing high-volume automated tasks (CI agents, PR review pipelines, batch refactors) where per-task accuracy differences matter less than cost-per-thousand-tasks.

The 24B is almost suspiciously cheap at $0.10/M input. For interactive coding sessions where input tokens dominate (you’re feeding large context), this is the more sensible API choice than the 123B unless you explicitly need the accuracy bump.


Mistral Vibe CLI: what it is and how to get it running

Vibe CLI is Mistral’s open-source terminal coding agent, distributed under Apache 2.0. The current version is v2.13.0, released May 29, 2026. Think Claude Code or OpenAI Codex CLI, but with Mistral models and full self-hosting support.

Installation (Linux/macOS):

curl -LsSf https://mistral.ai/vibe/install.sh | bash

Or via pip (Python 3.12+ required):

pip install mistral-vibe

The install exposes two commands: vibe for interactive sessions and vibe-acp for automated/CI pipelines. Both require a MISTRAL_API_KEY environment variable pointing to a Mistral platform API key.

What Vibe can do out of the box:

  • File reads, writes, and diffs across your working tree
  • Shell command execution with approval gating
  • Git operations (status, diff, commit, branch management)
  • Code search (ripgrep-based)
  • Session continuation — pick up exactly where you left off
  • MCP (Model Context Protocol) server connections via HTTP, streamable-HTTP, or stdio
  • Slash commands with autocompletion, plus user-defined skills
  • Built-in agent modes: default, plan, accept-edits, auto-approve
  • Subagents for multi-step task delegation
  • Voice mode (experimental)

The four agent modes matter practically. plan makes Vibe output a reasoning trace before touching any file — useful when you don’t trust the model’s first instinct on a large refactor. accept-edits auto-applies file changes but pauses for shell commands. auto-approve is full automation, intended for CI/CD contexts where you’ve already validated the task scope.

MCP support in v2.13.0 is solid. The configuration follows a TOML format: you declare server name, transport type, and connection details in ~/.config/vibe/config.toml. Tools are scoped as {server_name}_{tool_name}, and permission levels can be set per server. For a database-connected coding workflow (MCP Postgres server + Vibe), the setup is about 10 lines of config.


May 2026 update: Vibe defaults to Mistral Medium 3.5

On May 2, 2026, Mistral replaced Devstral 2 as the default model in Vibe CLI with Mistral Medium 3.5. This matters more than a typical minor update.

Medium 3.5 is a dense 128B model with a 256k context window that scores 77.6% on SWE-bench Verified — 5.4 points above Devstral 2 and 0.4 points above Claude Sonnet 4.5’s 77.2%. It handles instruction-following, reasoning, and coding in a single set of weights rather than relying on a specialized coding model.

The other addition is remote cloud agents. You can now offload a Vibe task to Mistral’s cloud, let it run in the background, and receive a notification when it’s done. Local sessions can be “teleported” to the cloud mid-task — your session history, tool approvals, and task state carry across. This is a direct answer to Claude Code’s async agent runs and OpenAI Codex CLI’s auto mode.

What this means for Devstral 2: it’s still the best open-weight option for self-hosted or privacy-constrained deployments. Medium 3.5 is not open-weight — you can’t download its weights and run it locally. If your constraint is “no tokens leave my machine,” Devstral Small 2 remains the answer. If you’re happy with API inference and want the strongest model Vibe offers, Medium 3.5 is now that default.


Where Vibe beats Claude Code — and where it doesn’t

Vibe wins:

  • Open weights for the local story. Devstral Small 2 is self-hostable on commodity hardware. Claude Code requires Anthropic’s API; there’s no local path.
  • API pricing. Even at the non-promotional Devstral 2 rates, $2/M output tokens versus Claude Sonnet’s $15 is a meaningful difference at the scale where agentic pipelines run.
  • EU sovereignty / compliance. Mistral is a French company subject to EU AI Act governance. For teams with EU data residency requirements, Vibe is easier to justify legally than US-headquartered alternatives.
  • MCP configuration surface. Vibe’s TOML config is more transparent and reproducible than Claude Code’s MCP setup. Easier to version-control in a team environment. (Need a ready-to-paste Claude Code .mcp.json or Cursor mcp.json? The MCP Server Config Generator generates both.)

Claude Code wins:

  • Accuracy on hard tasks. Medium 3.5’s 77.6% now edges out Claude Sonnet, but Claude Code ships with Opus 4 for complex tasks — the highest available on any commercial coding agent. On genuinely difficult multi-repo work, the Opus 4 option is still the benchmark.
  • IDE integrations. Claude Code has first-class VS Code, JetBrains, and Zed extensions. Vibe is terminal-first; IDE integration is a community plugin story.
  • Ecosystem breadth. Cursor, MCP servers, CLAUDE.md conventions, custom slash commands — the Claude ecosystem has more tooling built around it. For a deeper breakdown, see Claude Code vs OpenAI Codex CLI 2026.

Where neither is decisive: interactive daily-driver use at the $20/month price point. At that spend, Cursor’s full product still has the better IDE integration and the tighter edit-review loop for file-by-file work. Both Vibe and Claude Code are stronger as autonomous task runners than as interactive inline suggestions.


The verdict

Devstral Small 2 on a single RTX 4090 is the clearest path to zero-cost, privacy-preserving, state-of-the-art local coding inference available in 2026. The 68% SWE-bench number is not a consolation prize — it’s the best any open-weight model achieves at 24 billion parameters, period.

Mistral Vibe CLI at v2.13.0 is good enough to stop calling it “promising.” MCP support, session continuation, four agent modes, and remote cloud agents put it on equal footing with Claude Code for terminal-native workflows. The gap is IDE integration and ecosystem depth — both fixable over 12 months.

For teams running high-volume automated coding pipelines and watching per-token costs, the Devstral 2 API is the obvious answer while the free promotional window lasts, and stays defensible at $2/M output after that.

The only reason to reach past this stack to Claude Sonnet (or Opus 4) is if the 5-point SWE-bench accuracy gap translates to actual production errors in your specific codebase — and for most workloads, it won’t.


Frequently Asked Questions

Can Devstral Small 2 really run on a single RTX 4090? Yes, at Q4_K_M quantization the 24B model fits in ~15GB VRAM — inside the 4090’s 24GB. You get 68% SWE-bench accuracy and the full 256k context window. To use the full 256k context you’ll need ~26GB VRAM (Q8) or a multi-GPU setup; at default 4K–32K context, the 4090 handles it without issue.

What is the difference between Devstral 2 and Devstral Small 2? Devstral 2 is 123B parameters, scores 72.2% on SWE-bench Verified, requires four H100-class GPUs for full-precision local inference, and ships under a modified MIT license with a commercial revenue cap. Devstral Small 2 is 24B parameters, scores 68.0%, runs on a single RTX 4090 or 32GB Mac, and is Apache 2.0 with no commercial restrictions.

Is Mistral Vibe CLI free? The CLI itself is free and open-source (Apache 2.0). You pay for inference: API token costs if using Mistral’s cloud, or nothing if running Devstral Small 2 locally. The Devstral 2 API is currently free during the launch promotional period; post-promotion pricing is $0.40/$2.00 per million input/output tokens.

How does Mistral Vibe compare to Claude Code in 2026? Feature parity is close for terminal-native workflows — both support MCP servers, slash commands, agent modes, and session management. Claude Code has stronger IDE integrations and access to Claude Opus 4 for maximum-difficulty tasks. Vibe has an open-weights local path, cheaper API inference, and EU sovereignty advantages. The default model switch to Medium 3.5 (77.6% SWE-bench) now puts Vibe ahead on raw benchmark accuracy for API-based use.

What happened to Devstral 2 as the default Vibe model? Mistral replaced Devstral 2 with Mistral Medium 3.5 as the default model in Vibe CLI on May 2, 2026. Medium 3.5 scores 77.6% on SWE-bench Verified (versus Devstral 2’s 72.2%) and adds remote cloud agent capabilities. Devstral 2 is still available as a selectable model; it remains the right choice when you need API inference at 7x lower cost than Claude Sonnet.


Sources

Last updated May 30, 2026. Pricing and features change frequently; verify current state before purchasing.

Was this article helpful?