Jun 19, 2026

Codestral 2 as your Cursor and Cline backend in 2026: Apache 2.0, $0.30/M tokens, 256K context, and whether it beats Gemini 3.5 Flash for daily coding

By AICoderScope Team · 11 min read

mistralcursorclinecontinue-devlocal-llmsetup-guideapicostvs

TL;DR: Codestral 2 went Apache 2.0 on April 8, 2026, which makes it the cheapest legally-clean-to-self-host coding model worth wiring into your editor. At $0.30/M input via Mistral’s API it slots into Cursor Chat, Cline, and Continue.dev in about ten minutes. Its real edge is fill-in-the-middle autocomplete, not agentic reasoning — so pick it for tab completion and privacy, not for multi-step Cline runs.

	Codestral 2	DeepSeek V4-Flash	Gemini 3.5 Flash
Best for	FIM autocomplete + self-host	Agentic Cline work, cheapest	Balanced cloud agent
Price (input / output per M)	$0.30 / $0.90	$0.14 / $0.435	$1.50 / ~$6
License	Apache 2.0 (self-host free)	MIT (self-host free)	Proprietary (API only)
Context window	256K	1M	1M
Params	22B dense	MoE (cloud)	proprietary
The catch	Weaker at multi-step agentic tasks	Thinking mode breaks Cline if left on	No self-host, no FIM endpoint

Honest take: If you want the best inline autocomplete you can legally run on your own GPU, Codestral 2 is the pick — wire it into Continue.dev’s FIM slot. If you want a chat/agent backend for Cline, DeepSeek V4-Flash is both cheaper and stronger. Don’t use Codestral 2 for heavy agent loops just because it’s open.

What actually changed in April 2026

Codestral has existed since May 2024, but the version that matters is Codestral 2, released April 8, 2026. The headline isn’t a benchmark bump — it’s the license. The original Codestral shipped under the Mistral Non-Production License, which barred commercial use in your product. Codestral 2 is Apache 2.0. That single change is why it’s worth a fresh look: you can now self-host it inside a commercial product, ship it on a private server, or run it on a workstation GPU without a lawyer in the loop.

The model itself is a 22-billion-parameter dense transformer (not a mixture-of-experts), with a 256K-token context window and support for 80+ languages. Mistral reports 86.6% on HumanEval and 91.2% on MBPP, with native fill-in-the-middle (FIM) training — the thing that makes inline autocomplete feel native rather than bolted on.

The “dense, not MoE” detail matters more than it looks. A 22B dense model has predictable VRAM and throughput. You’re not juggling 384 experts like Kimi K2.7 or a 671B sparse stack like DeepSeek’s flagship. At Q4_K_M the weights are roughly 9 GB, so it fits on a single 16 GB card with room for a modest context window. (For the full 256K context you’ll need far more — that’s a server-class ask, not a laptop one. The runaihome.com local coding LLM guide has the VRAM math by GPU tier.)

Two ways to run it

You have two paths, and they map to different goals:

Mistral API (api.mistral.ai) — fastest, zero hardware, $0.30/M in. Use this if you just want a cheap, capable chat/edit backend and don’t care where the tokens go.
Self-hosted via Ollama or vLLM — slower on consumer hardware, but the code never leaves your machine. This is the Apache-2.0 payoff. Use it for client code under NDA or air-gapped work.

Pull the local copy first if you want to test offline:

$ ollama pull codestral
pulling manifest
pulling 0bbfda8e64c1... 100%  ▕████████████████▏  12 GB
pulling f5 db17... 100%  ▕████████████████▏  559 B
success

$ ollama run codestral "write a Python function that returns the nth Fibonacci number iteratively"
def fib(n: int) -> int:
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

Tested with Ollama 0.12.x on June 19, 2026. On a single RTX 4090 the Q4_K_M build runs around 45–55 tokens/sec for short completions, which is fine for chat and edits but noticeably slower than a cloud call for long agent loops.

If you’re going cloud, grab a key from console.mistral.ai and smoke-test it:

$ curl -s https://api.mistral.ai/v1/chat/completions \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"codestral-latest","messages":[{"role":"user","content":"say ok"}]}' \
  | python3 -c "import sys,json;print(json.load(sys.stdin)['choices'][0]['message']['content'])"
ok

codestral-latest is the rolling alias; pin the dated version if you want reproducibility.

Wiring it into Cline

Cline takes any OpenAI-compatible endpoint, so the Mistral API drops straight in.

Open the Cline panel → Settings (gear icon).
API Provider: choose OpenAI Compatible.
Base URL: https://api.mistral.ai/v1
API Key: your Mistral key.
Model ID: codestral-latest
Save, then start a task.

That’s the whole setup. Where it gets interesting is what to use it for. Codestral 2 is a code-specialist, not a generalist agent. On a single “edit this function” task it’s excellent. On a 12-step Cline plan — read three files, run a test, parse the failure, patch, re-run — it loses the thread sooner than DeepSeek V4-Flash or Gemini 3.5 Flash. If your Cline workflow is mostly “apply this focused change,” Codestral 2 is great and cheap. If it’s “figure out why the integration test flakes and fix it,” reach for DeepSeek V4-Flash instead.

One practical note: unlike DeepSeek V4-Flash, Codestral 2 has no separate “thinking mode” to disable, so you skip the tool-call loop trap that bites Cline users on reasoning models. It just answers.

Wiring it into Cursor (and the Tab caveat)

Cursor lets you override the OpenAI base URL, which routes Chat and Cmd-K through Codestral 2:

Settings → Models.
Scroll to OpenAI API Key, expand the override.
Base URL: https://api.mistral.ai/v1
Paste your Mistral key, click Verify.
Add a custom model named codestral-latest and enable it.

Here’s the catch every Cursor power user hits: the custom endpoint powers Chat and Cmd-K, but not Tab. Cursor’s Tab autocomplete runs on Cursor’s own proprietary models and cannot be repointed at an external API. So routing Cursor through Codestral 2 gets you a cheaper chat/edit backend, but your inline gray-text completion is still Cursor’s. This is the same limitation that applies to every external backend in Cursor — see the Cursor + Ollama setup guide for the full breakdown.

That limitation is exactly why, if autocomplete is what you care about, Continue.dev is the better host for Codestral 2 — because Continue can use the dedicated FIM endpoint.

Continue.dev: the FIM setup, and the bug that quietly breaks it

This is where Codestral 2 earns its keep. Continue.dev lets you assign a model to the autocomplete role and point it at Mistral’s dedicated FIM endpoint, which is a different host from the chat API:

FIM completions  →  https://codestral.mistral.ai/v1/fim/completions
Chat completions →  https://api.mistral.ai/v1/chat/completions

In your Continue config (~/.continue/config.yaml in the current YAML format), the autocomplete model looks like this:

models:
  - name: Codestral FIM
    provider: mistral
    model: codestral-latest
    apiKey: YOUR_MISTRAL_KEY
    apiBase: https://codestral.mistral.ai/v1
    roles:
      - autocomplete
    autocompleteOptions:
      maxPromptTokens: 1024
      debounceDelay: 250

The problem: completions feel dumb and slow

Here’s the real-world snag. Several Continue users (tracked in continuedev/continue issue #7178) found that autocomplete was hitting …/v1/chat/completions instead of …/v1/fim/completions. The symptoms: completions arrive late, ignore the code after your cursor, and sometimes spit out a markdown code fence into your editor. That’s the chat endpoint pretending to do autocomplete — it only sees the prefix, never the suffix, so it can’t do true fill-in-the-middle.

The fix

Two things, in order:

Set provider: mistral and apiBase: https://codestral.mistral.ai/v1 explicitly on the autocomplete model. The mistral provider knows to call the FIM route; a generic openai provider will default to chat. If you’d configured it as an OpenAI-compatible model pointed at api.mistral.ai, that’s your bug.
Confirm it. Watch the Continue output channel (View → Output → “Continue”) while you type. You should see requests to /fim/completions. If you still see /chat/completions, your provider is wrong, not your key.

Once FIM is actually firing, suffix-aware completion is the difference-maker: type the opening of a function with a closing brace already below it, and Codestral 2 fills the body to match what comes after, not just what came before. That’s the capability Gemini 3.5 Flash and most chat models simply don’t expose — there’s no FIM endpoint to call.

Price and quality, side by side

The decision usually comes down to three models in this price band. Here’s the verified picture as of June 19, 2026:

Model	Input $/M	Output $/M	Context	License	FIM endpoint	Self-host
Codestral 2	$0.30	$0.90	256K	Apache 2.0	✅ Yes	✅ Yes
DeepSeek V4-Flash	$0.14	$0.435	1M	MIT	❌ No	✅ Yes
Gemini 3.5 Flash	$1.50	~$6	1M	Proprietary	❌ No	❌ No

DeepSeek V4-Flash is cheaper per token and stronger on agentic, multi-step coding — it scores within a point of Claude Sonnet on SWE-bench Verified. Gemini 3.5 Flash brings 76.2% on Terminal-Bench and a 1M context, but you pay 5× Codestral’s input rate and you can’t run it locally. (Both have full setup guides: DeepSeek V4-Flash and Gemini 3.5 Flash.)

So why pick Codestral 2 at all? Two reasons the table makes obvious:

It’s the only one with a real FIM endpoint. For inline autocomplete in Continue.dev, that’s not a nice-to-have — it’s the entire feature. A 1M-context chat model can’t fill in the middle through a chat API.
It’s a 22B dense Apache-2.0 model, which means you can self-host it on one consumer GPU and legally ship it. DeepSeek V4-Flash is MIT but its full weights are far heavier to run; Gemini can’t be self-hosted at all.

Who should actually use it

Pick Codestral 2 if you are: a developer who lives in inline autocomplete and wants it pointed at an open model via Continue.dev; a team that needs code completion on-prem for compliance and wants Apache-2.0 freedom; or anyone running a 16 GB GPU who wants a capable local code model that isn’t a 4-bit compromise.

Skip it if your main use is agentic Cline runs (DeepSeek V4-Flash wins on price and reasoning), or if you only care about Cursor Tab (which you can’t repoint regardless of backend).

FAQ

Is Codestral 2 free? The weights are free under Apache 2.0 — you can download and self-host at no cost. The hosted API is paid: $0.30/M input, $0.90/M output. There’s also a separate Codestral FIM endpoint billed at the same rate.

Can I use Codestral 2 for Cursor’s Tab autocomplete? No. Cursor’s Tab feature runs only on Cursor’s proprietary models. Custom OpenAI-compatible backends, including Codestral 2, only power Chat and Cmd-K in Cursor. For open-model autocomplete, use Continue.dev’s FIM role instead.

What’s the difference between the chat endpoint and the FIM endpoint? The chat endpoint (api.mistral.ai/v1/chat/completions) only sees the code before your cursor. The FIM endpoint (codestral.mistral.ai/v1/fim/completions) sees both the prefix and suffix, so completions fit the surrounding code. Autocomplete should always use FIM; if Continue is calling chat for completions, set provider: mistral and the codestral.mistral.ai apiBase.

How much VRAM do I need to self-host it? At Q4_K_M the weights are about 9 GB, and with a modest context window it runs on a single 16 GB GPU. The full 256K context needs far more memory — treat that as server-class. See the runaihome.com VRAM guide for tiered numbers.

Codestral 2 or DeepSeek V4-Flash for Cline? DeepSeek V4-Flash. It’s cheaper ($0.14 vs $0.30 input) and handles multi-step agentic tasks better. Use Codestral 2 in Cline only for focused single-edit tasks where its code-specialist strength shows.

Sources

Last updated June 19, 2026. Pricing and features change frequently; verify current state on the official Mistral pricing page before purchasing.

Was this article helpful?