Jun 14, 2026

Gemma 4 QAT + Cline and Continue.dev in 2026: Which Quantized Coding Model Runs in 7GB, 15GB, or 18GB VRAM

By AICoderScope Team · 13 min read

gemmalocal-llmclinecontinue-devollamasetup-guide

TL;DR: Google’s June 5 Gemma 4 QAT release drops 4-bit memory by roughly 72%, which means the 12B fits in about 7GB of VRAM, the 26B-A4B MoE in about 15GB, and the dense 31B in about 18GB — all on hardware you might already own. For inline chat and FIM autocomplete in Continue.dev, the 12B is the sweet spot. For agentic file-editing in Cline, you need 26B-A4B or 31B and Ollama 0.22.1+, or the tool calls silently fail.

	Gemma 4 12B (dense)	Gemma 4 26B-A4B (MoE)	Gemma 4 31B (dense)
4-bit QAT VRAM	~7 GB	~15 GB	~18 GB
Best for	Continue.dev chat + autocomplete	Cline agentic edits on a 16GB card	Cline on 24GB, hardest reasoning
Context window	256K	256K	256K
The catch	Weak on long multi-step agent loops	Needs Ollama 0.22.1+ for tool calls	24GB GPU for comfortable context

Honest take: If you have a single 8GB–16GB GPU and want local AI coding that actually feels useful, run Gemma 4 12B QAT as your Continue.dev chat-and-autocomplete model and stop there. Cline’s agentic loop wants the 26B-A4B on a 16GB+ card — but verify your Ollama version first, because the tool-calling parser was broken until 0.22.1 and that single fact wastes more afternoons than any config typo.

Google DeepMind shipped quantization-aware training (QAT) checkpoints for the whole Gemma 4 family on June 5, 2026 — two days after the Gemma 4 12B base model itself landed. The headline is memory: QAT bakes the quantization into training instead of bolting it on afterward, so the 4-bit checkpoints keep near-original quality while using roughly one-third the VRAM of the bf16 weights. For local AI coding, that is the number that matters. A model that needed a 24GB card last month now runs on a 16GB laptop, and the 12B drops onto an 8GB GPU with headroom to spare.

This guide is about turning that into a working setup: which size to pick for which tool, the exact Ollama tags, the context-window dial you have to set, and the one tool-calling bug that makes Cline look broken when it isn’t.

What “QAT” actually buys you

Standard post-training quantization (PTQ) takes a finished bf16 model and rounds the weights down to 4-bit. It works, but accuracy slips — and on code, where one wrong token breaks a build, that slip is expensive. QAT runs the quantization math during training, so the model learns weights that survive the rounding. Google reports the QAT 4-bit checkpoints land closer to full precision than naive PTQ, at about 72% less memory.

The concrete savings, from Google’s own figures: the dense 31B at 16-bit is roughly 60GB; the 4-bit QAT checkpoint lands in the 17–19GB range. That is the difference between needing two 3090s and needing one. Here is how the family shakes out for a coding box:

Model	Type	Active params	4-bit QAT VRAM	Realistic GPU
Gemma 4 E2B	dense	2B	<1 GB (text-only)	Any laptop / iGPU
Gemma 4 E4B	dense	4B	~3 GB	6GB GPU
Gemma 4 12B	dense	12B	~7 GB	8GB GPU (RTX 4060)
Gemma 4 26B-A4B	MoE	4B of 26B	~15 GB	16GB GPU / 16GB Mac
Gemma 4 31B	dense	31B	~18 GB	24GB GPU (RTX 3090/4090)

The 26B-A4B is the interesting one. It is a Mixture-of-Experts model: 26B total parameters but only ~4B active per token. So it loads like a 15GB model but runs at the speed of a 4B model while reasoning with the breadth of something much larger. On a 16GB card it is the best agentic-coding value in the lineup right now.

All sizes carry a 256K context window except E2B/E4B (128K). Every variant handles text and images; E2B, E4B, and 12B also do video and audio natively — irrelevant for coding, but it explains why the base downloads are larger than a text-only model of the same size.

Grab the right GGUF — and skip the naive Q4_0

There is a quality trap in the file naming. The naive Q4_0 conversion of Gemma 4 degrades accuracy more than it should, even though the file is larger than smarter quants. The community fix is Unsloth’s UD-Q4_K_XL dynamic GGUFs, which apply different bit-widths to different layers and recover most of the lost accuracy. Google also publishes official q4_0-gguf and w4a16-ct checkpoints on Hugging Face, but for local runs the Unsloth dynamic quants are the safer default.

In practice:

12B is published by Ollama directly under a stock -it-qat tag, so you can just pull it.
E4B, 26B-A4B, and 31B are distributed as Unsloth dynamic GGUFs — pull them from Hugging Face or point Ollama at the GGUF.

# 12B QAT — the easy path, straight from Ollama's library
ollama pull gemma4:12b-it-qat

# 26B-A4B and 31B — Unsloth dynamic GGUF (recommended quant)
# (download the UD-Q4_K_XL file from huggingface.co/unsloth, then:)
ollama create gemma4-26b-qat -f Modelfile

A minimal Modelfile for the larger Unsloth GGUFs looks like this:

FROM ./gemma-4-26B-it-qat-UD-Q4_K_XL.gguf
PARAMETER num_ctx 16384
PARAMETER temperature 0.7
PARAMETER top_p 0.95

One non-negotiable: use Ollama 0.22.1 or newer. The 0.22.1 release ships a rewritten Gemma 4 renderer that finally handles the model’s explicit thinking mode and tool calling locally. Earlier builds (through 0.20.1) had a broken tool-call parser — the model would emit a valid function call and Ollama would hand back plain text. You will not see an error; the agent just acts like a chatbot. Check before you debug anything else:

$ ollama --version
ollama version is 0.22.1

Continue.dev: the 12B is your daily driver

Continue.dev gives you Copilot-style chat, inline edits, and tab-autocomplete pointed at a local model — and the 12B QAT is the right size for all three on an 8GB–16GB machine. Continue added native Gemma 4 model support in June, so the config is straightforward.

Two roles matter here, and they want different treatment. The chat/edit role is where the 12B shines: it is smart enough to explain a function, refactor a block, or write a test, and at ~21 tokens/second on an RTX 4060 (community-measured, llama.cpp) it is fast enough to feel interactive. The autocomplete role uses fill-in-the-middle (FIM): Continue sends the prefix and suffix of your file and asks the model to predict the middle. Gemma 4 can do this, but a 12B is heavier than you want firing on every keystroke — many people pair it with a small dedicated FIM model and keep the 12B for chat.

A working config.yaml that splits the roles:

models:
  - name: Gemma 4 12B (chat)
    provider: ollama
    model: gemma4:12b-it-qat
    roles:
      - chat
      - edit
    defaultCompletionOptions:
      contextLength: 16384

  - name: Gemma 4 E4B (autocomplete)
    provider: ollama
    model: gemma4:e4b-it-qat
    roles:
      - autocomplete

The contextLength line is the dial people forget. Ollama defaults to a small context (historically 2K–4K depending on build), and if you do not raise num_ctx on the model and tell Continue to use it, your “256K context” model will silently truncate your file at a few thousand tokens. Set both. For a coding workload, 16K is a sane floor; push to 32K if your VRAM allows, because every extra token of KV cache eats memory on top of the ~7GB the weights already use.

If you want the full walkthrough of Continue + Ollama wiring, the Continue.dev + Ollama local setup guide covers the VS Code and JetBrains paths, and the Continue.dev + LM Studio guide covers the GGUF-via-LM-Studio route if you prefer that runner.

Cline: agentic editing needs the bigger model and the right Ollama

Cline is a different animal. It does not just suggest — it runs an agentic loop: read files, plan, write edits, run a terminal command, check the result, iterate. That loop lives or dies on tool calling. The model has to emit structured function calls reliably, dozens of times per task, without drifting into “here’s what I would do” prose.

This is exactly where Gemma 4 gets tricky. The model family does have native function calling, and it is genuinely capable — but the local-runtime support has been catching up. As one developer put it after testing Gemma 4 in an agentic shell pre-fix: it “worked as chat only — couldn’t retrieve files, couldn’t call tools.” That was the Ollama parser problem, not the model. With Ollama 0.22.1, the tool calls flow through correctly.

So the Cline checklist is short but strict:

Ollama 0.22.1+. Non-negotiable, for the reason above.
26B-A4B minimum, 31B if you have 24GB. The 12B can technically tool-call, but it loses the thread on multi-step tasks — it forgets it already edited a file, or re-reads the same one in a loop. The 26B-A4B’s broader reasoning holds the plan together far better, and because it is MoE it stays fast.
Raise num_ctx hard. Agentic loops accumulate context fast — file contents, tool outputs, history. A 4K window guarantees the agent forgets the task. Give it at least 32K; the 256K ceiling exists for exactly this.

Point Cline at your local Ollama by selecting the “Ollama” provider and the model tag you created. If you hit the classic symptom — the agent describes actions instead of taking them, or loops re-reading the same file — that is the tool-call path failing, and our Cline + Ollama tool-use loop fix walks through the same class of bug with a different model. The Cline + LM Studio setup is the alternative if you would rather run the GGUF through LM Studio’s server than Ollama.

The problem I keep seeing, and the fix

The single most common failure report for “Gemma 4 + Cline doesn’t work” is not a config error — it is a version error. People install Cline, pull gemma4, point them at each other, and watch the agent answer in chat-only mode. They blame the model. The actual cause, nine times out of ten, is an Ollama build older than 0.22.1 with the broken Gemma 4 tool parser. Upgrade Ollama, restart the server, recreate the model, and the same setup that “didn’t work” starts editing files. Check ollama --version before you touch anything in Cline’s config.

How good is Gemma 4 at code, honestly?

Good, not class-leading. On the public comparison numbers, Gemma 4 31B posts roughly 92.1% on HumanEval, 90.3% on MBPP, 64.7% on LiveCodeBench, 61.4% on SWE-Bench Verified, and 66.9% on Aider polyglot edit accuracy. Those are solid open-weight figures — but Qwen 3.6 72B beats it on every one of them (94.8% HumanEval, 71.4% LiveCodeBench, 68.2% SWE-Bench Verified, 74.6% Aider polyglot). Treat cross-source benchmark numbers as directional: different harnesses and benchmark versions inflate or deflate them, and Gemma 4 31B has separately been reported at ~80% on LiveCodeBench v6, which only underscores that the version matters.

What this means in practice: Gemma 4 is not the model you pick to top a leaderboard. You pick it because it runs on your machine, with no API bill, no data leaving the box, and a 256K window — and at that job, in the 7–18GB VRAM band, it is one of the better options available right now. If raw coding accuracy is all you care about and you can spare the VRAM, a Qwen3-Coder variant will edit more reliably; see our Cursor + Ollama local model setup for routing local models into Cursor’s Chat and Cmd+K too.

Which size, which tool — the decision

8GB GPU (RTX 4060, etc.): Gemma 4 12B QAT in Continue.dev for chat and edits; pair with E4B for autocomplete. Skip agentic Cline at this tier — it will frustrate you.
16GB GPU or 16GB Mac: Gemma 4 26B-A4B QAT. This is the value pick: Continue.dev for inline work and Cline for light agentic tasks, all on one model. Verify Ollama 0.22.1+.
24GB GPU (RTX 3090/4090): Gemma 4 31B QAT with a generous 32K–64K context for serious Cline sessions, or run 26B-A4B and spend the spare VRAM on a longer window.

For the hardware side of this decision — which GPU actually clears the 15GB and 18GB bars without thermal throttling, and whether a used 3090 still beats a new mid-range card for local LLM coding — see runaihome.com’s GPU buying guide for local AI. Gemma, Ollama, Continue.dev, and Cline are all open source or open weight; if you are building an all-FOSS coding stack, aifoss.dev tracks the wider ecosystem.

FAQ

Do I need the QAT version, or will the regular Gemma 4 work? The non-QAT model works, but you will use ~3× the VRAM for the same quality at 4-bit. QAT is the whole point of running locally — it is what puts the 31B on a single 24GB card. Always grab the QAT checkpoint for local coding.

Why does Cline answer in chat instead of editing files? Almost always an Ollama version older than 0.22.1 with the broken Gemma 4 tool-call parser. Run ollama --version, upgrade if needed, restart the server, and recreate your model. The model is fine; the runtime was dropping the function calls.

Q4_0 or UD-Q4_K_XL — does it really matter? Yes. Naive Q4_0 degrades Gemma 4 accuracy noticeably despite a larger file. Unsloth’s UD-Q4_K_XL dynamic quant recovers most of that loss. For the 12B, Ollama’s -it-qat tag already handles this; for 26B-A4B and 31B, use the Unsloth dynamic GGUF.

Can I run this for autocomplete like Copilot? Yes, via Continue.dev’s autocomplete role, which uses fill-in-the-middle. The 12B works but is heavier than ideal for every-keystroke completion; pairing E4B (autocomplete) with 12B (chat) gives you snappier suggestions and smarter chat.

Will the 256K context actually fit in 7GB? No. The 7GB is weights only. KV cache for a long context adds gigabytes on top — a full 256K window needs far more memory than the model itself. For coding, set num_ctx to 16K–32K and you stay comfortable; reserve the big windows for machines with VRAM to burn.

Sources

Last updated June 14, 2026. Pricing and features change frequently; verify current state before purchasing or deploying.

Was this article helpful?