Jun 3, 2026

Continue.dev + Ollama 2026: local AI coding setup for VS Code and JetBrains with no API key

By AICoderScope Team · 13 min read

continue-devollamalocal-llmsetup-guidejetbrainsvscodeprivacyqwen

TL;DR: Continue.dev + Ollama gives you free, fully local AI coding in both VS Code and JetBrains — the only open-source combo that covers both IDEs. Setup takes 20 minutes. The one trap that breaks most people: Ollama defaults to a 2,048-token context window and silently discards anything beyond it. Fix that before writing a single line of config.

What you’ll be able to do after this guide: run tab autocomplete, chat, and multi-file edits against a local model — no API key, no internet required, no code leaving your machine.
What you’ll need: a GPU with ≥8 GB VRAM (or Apple Silicon with ≥32 GB unified memory), VS Code or any JetBrains IDE, and about 20 minutes.
Where this setup hits its ceiling: agent tasks that span more than 5 files — at that complexity, cloud models (Claude Sonnet 4.6, GPT-4o) still outperform any local 14B model.

	Continue.dev + Ollama	Cursor Pro	Cline + Ollama
Best for	VS Code and JetBrains, local-only	Best VS Code agent	VS Code agentic local
Price / Cost	$0, no API bill	$20/mo, usage-capped	$0, no API bill
The catch	Agent lags Cursor on complex tasks	No local model option at all	VS Code only, no JetBrains

Honest take: If you’re on IntelliJ, PyCharm, or GoLand and need zero cloud, Continue.dev + Ollama is the only serious option. On VS Code, it ties with Cline for local agent work — choose based on whether you want a guided autonomous agent (Cline) or per-role model control with autocomplete (Continue.dev).

Why Local-Only Matters

Most “privacy-first AI coding” guides send your prompts through a relay. Continue.dev + Ollama is different: the VS Code and JetBrains extension is Apache 2.0 open-source, inference runs on your machine, and the BYOK model means there’s no Continue-operated server in the request path. If you’re working on code under NDA, on a pre-launch product, or under a company policy that prohibits sending source to cloud vendors, this is the setup that actually satisfies those requirements.

The practical check: after setup, pull up your system’s network monitor and start a chat. The only connections you’ll see are local (localhost:11434). Nothing to Anthropic. Nothing to OpenAI. Nothing to Continue servers. That’s verifiable in a way that “we don’t train on your data” policy language is not.

Hardware Floor

The model you can run is bounded by VRAM. Approximate fits for the recommended coding models:

VRAM / Memory	Recommended model	Realistic use case
8 GB VRAM (RTX 4060)	qwen2.5-coder:7b (Q4)	Tab autocomplete only; chat is marginal
12 GB VRAM (RTX 3060 12GB)	qwen2.5-coder:14b (Q4)	Real daily-driver for autocomplete + chat
16 GB VRAM (RTX 4060 Ti 16GB)	qwen2.5-coder:14b (Q5)	Solid local setup
24 GB VRAM (RTX 3090 / RTX 4090)	qwen2.5-coder:32b (Q4)	Best local tier; approaches cloud on single-file tasks
32 GB Apple unified memory (Mac Studio M3 Ultra)	qwen2.5-coder:14b comfortably	macOS sweet spot
64 GB+ Apple unified memory	qwen2.5-coder:32b	Best macOS local setup

The 7B model is tempting because it’s fast, but it fails on anything more complex than single-function completions. For chat and edit tasks where Continue.dev shines, 14B is the practical minimum. For a deeper breakdown of which model fits which hardware, our sister site’s Best Local AI Models by VRAM tier guide covers the full landscape.

Step 1: Install Ollama

Ollama v0.30.2 released June 3, 2026 is the current version. Install:

Linux:

curl -fsSL https://ollama.com/install.sh | sh

macOS / Windows: Download the installer from ollama.com/download and run it.

Verify the install and check the service is running:

ollama --version
# Expected output: ollama version 0.30.2

curl http://localhost:11434/api/tags
# Expected: {"models":[...]} — empty array if no models pulled yet

Ollama runs as a background service on port 11434. On Linux it installs as a systemd service. On macOS it runs as a menu bar app.

Step 2: Pull a Coding Model

Pick based on your VRAM tier from the table above. For the 14B tier:

ollama pull qwen2.5-coder:14b

This downloads approximately 9 GB. Grab a coffee. Verify it arrived:

ollama list
# NAME                    ID              SIZE    MODIFIED
# qwen2.5-coder:14b       abc123def456    9.0 GB  2 minutes ago

If you want a dedicated autocomplete model (faster, lighter), also pull the 1.5B:

ollama pull qwen2.5-coder:1.5b
# ~1.1 GB — runs on any GPU with 2+ GB VRAM, response time under 300ms

Running a quick test before involving Continue.dev is worth the 30 seconds:

ollama run qwen2.5-coder:14b "Write a Python function to flatten a nested list."

If you get a sensible code response, the model and Ollama are working. Now the trap.

Step 3: Fix the Context Window — Do This First

This is the step that causes most Continue.dev + Ollama setups to produce bad output silently. Ollama’s default context window is 2,048 tokens. For a coding assistant that loads your files into context, this is catastrophic: Continue.dev might be sending 8,000 tokens of repo context, and Ollama silently discards everything past token 2,048. The model has no idea it’s missing 75% of the information. The responses look plausible — they’re just wrong.

Set the context window before starting your session. The simplest approach is the environment variable:

# Linux/macOS — set before starting Ollama, or export in ~/.bashrc / ~/.zshrc
export OLLAMA_NUM_CTX=16384

# Windows (PowerShell — add to your profile for persistence)
$env:OLLAMA_NUM_CTX = "16384"

For a permanent per-model fix that doesn’t require an env var, create a Modelfile:

# qwen-coder-ctx.Modelfile
FROM qwen2.5-coder:14b
PARAMETER num_ctx 16384

Then build it as a named local model:

ollama create qwen-coder-ctx -f qwen-coder-ctx.Modelfile

Now reference qwen-coder-ctx in your Continue config instead of the base model. You can verify the context is set:

ollama show qwen-coder-ctx --parameters
# context_length        16384   ← this is what you want to see

16,384 tokens is a safe floor for most coding tasks. For larger codebases or long agent sessions, push to 32,768 if your VRAM allows it (roughly 1–2 GB additional usage).

Step 4: Install Continue.dev in VS Code

Open VS Code, go to the Extensions panel, search for Continue, and install the extension by Continue Dev, Inc. (2.5 million installs as of May 2026, 33,000 GitHub stars). It will appear as a sidebar panel.

On first launch, Continue prompts you to configure a model. Skip the guided setup — you’ll write the config manually in the next step.

Step 5: Install Continue.dev in JetBrains

This is the step no other Continue.dev guide covers specifically, and it’s where the setup differs. In any JetBrains IDE (IntelliJ, PyCharm, GoLand, WebStorm, Rider):

Open Settings → Plugins → Marketplace
Search for Continue
Install and restart the IDE

After restart, a Continue panel appears in the right sidebar (look for the Continue icon — a small AI-assist indicator). You can also open it via View → Tool Windows → Continue.

The critical point for JetBrains users: the config.yaml file is shared with VS Code. Both IDEs read from ~/.continue/config.yaml (macOS/Linux) or %USERPROFILE%\.continue\config.yaml (Windows). Configure it once, and the same setup applies in both editors. This matters if you split your time between VS Code and an IntelliJ-family IDE.

Step 6: Write the config.yaml

The config file lives at:

macOS/Linux: ~/.continue/config.yaml
Windows: %USERPROFILE%\.continue\config.yaml
Project-level override: .continue/config.yaml at repo root (layered on top of global config)

Here is a complete local-only config. No API keys required — it talks exclusively to your Ollama instance:

name: Local-Only Config
version: 1.0.0
schema: v1

models:
  # Fast local model for tab autocomplete — 1.5B runs sub-300ms
  - name: Qwen Coder 1.5B (Autocomplete)
    provider: ollama
    model: qwen2.5-coder:1.5b
    apiBase: http://localhost:11434
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 300
      maxPromptTokens: 1024
      onlyMyCode: true

  # 14B for chat and edit — adjust to qwen2.5-coder:32b if your VRAM allows
  - name: Qwen Coder 14B (Chat)
    provider: ollama
    model: qwen-coder-ctx        # the Modelfile variant with num_ctx 16384
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit
      - apply

context:
  providers:
    - name: code
    - name: docs
    - name: diff
    - name: terminal
    - name: problems
    - name: folder
    - name: codebase

If you skipped the Modelfile approach and used the env var instead, change model: qwen-coder-ctx to model: qwen2.5-coder:14b.

The roles assignment is what makes this setup efficient. The 1.5B autocomplete model fires on every keystroke — it needs to be fast. The 14B model fires only when you invoke the chat panel or trigger an inline edit. You get both speed and depth from a single Ollama installation.

Save the file. In VS Code or JetBrains, the Continue panel reloads automatically when config.yaml changes. You don’t need to restart the IDE.

Test the Setup

Open any code file and start typing. Within 300ms, you should see ghost-text completions from the 1.5B model. Press Tab to accept.

Open the Continue chat panel (Cmd+L on macOS, Ctrl+L on Windows/Linux) and type:

Explain what this function does and suggest one improvement.

Select a function in your file first — Continue will include it as context. You should get a response from the 14B model within 3–15 seconds depending on your GPU tier.

If you get no autocomplete suggestions, see Troubleshooting below. If chat returns an error about the model not being found, verify the exact model name in ollama list matches the name in your config.yaml.

Verify No Data Leaves Your Machine

After setup, start a coding session and check your network connections in a second terminal:

# Linux/macOS
ss -tnp | grep -E '(11434|anthropic|openai|cursor)'

# Expected: only 127.0.0.1:11434 connections — nothing to external hosts

On Windows, use netstat -ano | findstr 11434. You should see connections to 127.0.0.1:11434 only. If you see outbound connections to any external host, check that your Continue config does not reference an external provider.

Troubleshooting

Autocomplete suggestions don’t appear: Check that the 1.5B model is running (ollama list). Verify debounceDelay isn’t set too high — 300ms is safe. Also check VS Code settings → Continue → verify autocomplete is enabled. In JetBrains, confirm the plugin is enabled and the Continue toolbar is visible.

Chat returns “model not found”: The model name in config.yaml must exactly match the output of ollama list. Including the tag matters — qwen2.5-coder:14b is different from qwen2.5-coder. If you created a Modelfile variant, use its exact name (qwen-coder-ctx in the example above).

Responses are too short or cut off mid-sentence: The context window is still at 2,048. Verify OLLAMA_NUM_CTX is set in the process that launched Ollama. On Linux with systemd, set it in the service’s environment file: /etc/systemd/system/ollama.service.d/override.conf.

JetBrains plugin shows “Loading…” indefinitely: This is typically an IPC issue. Restart the IDE. If it persists, check that Ollama is running (curl http://localhost:11434/api/tags) and that no firewall rule blocks localhost traffic.

Slow chat responses on Apple Silicon: Ensure Ollama is using the Metal backend. Run ollama run qwen2.5-coder:14b "" and look for llm_load_tensors: offloading X layers to GPU in the terminal output. If you see 0 layers offloaded, Ollama isn’t using the GPU — reinstall using the macOS .dmg installer from ollama.com/download.

Where This Setup Hits Its Limits

Be direct about this: local models at 14B lag behind Claude Sonnet 4.6 or GPT-4o on tasks that require holding large context coherently. Refactoring 800 lines across 5 files, understanding cross-package dependencies, or writing a new API endpoint that correctly mirrors the patterns across your whole codebase — these tasks expose the gap.

The setup shines on:

Single-function generation and refactoring
Test generation for isolated units
Explanation and documentation tasks
Quick inline edits where the code is already in the active file

It struggles with:

Multi-file agent tasks that exceed the context window’s effective capacity
Any task that requires understanding patterns spread across an entire monorepo
Real-time streaming during GPU-heavy loads (autocomplete latency spikes)

If you find yourself hitting those limits regularly and the privacy requirement is flexible, the Continue.dev review walks through the hybrid setup: local 1.5B for autocomplete, Claude Sonnet for chat — you get the speed of local with the capability of cloud for the tasks that need it.

FAQ

Does Continue.dev work in JetBrains without an Anthropic API key?
Yes. The config.yaml in the example above uses no external API keys. Every request goes to localhost:11434.

Can I run both the 1.5B and 14B models simultaneously?
Yes, if you have enough VRAM. Ollama keeps models in memory until another model is loaded. With 16+ GB VRAM, both fit. With 8 GB VRAM, Ollama will swap them — autocomplete will stall briefly when the 14B is loaded for chat.

What’s the difference between this guide and the Continue.dev configuration guide?
The configuration guide covers a hybrid setup (local autocomplete + cloud chat) for multi-language projects. This guide is for the fully local case — no cloud API at all, plus JetBrains-specific installation steps.

Does Continue.dev support DeepSeek models in Ollama?
Yes. Replace the model name in config.yaml with any model in your ollama list. deepseek-r1:32b works well for reasoning-heavy tasks if you have 24+ GB VRAM.

Will this work on a CPU-only machine?
Technically yes, but practically no. The 14B model on CPU takes 30–120 seconds per response. Autocomplete at that latency is unusable. You need a GPU.

Sources

Last updated Jun 3, 2026. Ollama version, Continue.dev version, and model availability change frequently — verify current versions before setup.

Recommended Gear

RTX 4060 (8 GB VRAM) — entry-level GPU for local autocomplete
RTX 3060 12 GB — practical floor for local chat (runs 14B Q4)
RTX 4060 Ti 16 GB — solid local coding tier
RTX 3090 (24 GB) — runs 32B Q4, best value for local LLM
RTX 4090 (24 GB) — fastest single-card option for 32B
Mac Studio M3 Ultra (64 GB+) — best macOS local LLM machine

Was this article helpful?