Continue.dev + LM Studio 2026: setup guide, the context-window dial you must set before loading, and which GGUF models pass the FIM test
TL;DR: Continue.dev v1.3.38 + LM Studio 0.4.15 gives you local AI coding in both VS Code and JetBrains — with a GUI model browser, automatic CUDA detection, and optional remote GPU access via LM Link. One trap stops most setups before they produce good output: LM Studio’s context window defaults to 4,096 tokens and you must increase it in the model settings before loading, not after. Miss that step and Continue.dev silently feeds the model 20% of the context it requested.
What you’ll be able to do after this guide:
- Serve any GGUF coding model from LM Studio at
http://localhost:1234/v1 - Configure Continue.dev with separate model roles — a lightweight 1.5B for tab autocomplete, a 14B or 32B for chat and edits — using a single
config.yaml - Get fill-in-the-middle (FIM) tab completions working in VS Code and JetBrains
| Continue.dev + LM Studio | Continue.dev + Ollama | Cursor Pro | |
|---|---|---|---|
| Best for | Windows + GUI model browser + LM Link | macOS / Linux, CLI-first | Best-in-class VS Code agent |
| Price / Cost | $0, no API bill | $0, no API bill | $20/mo, usage-capped |
| The catch | LM Studio is a multi-hundred-MB GUI app; no headless install | No GUI, needs CUDA path setup on Windows | No local model option at all |
Honest take: On Windows, LM Studio is the lower-friction path to local Continue.dev — CUDA auto-detection and a visual model browser beat Ollama’s CLI for developers who don’t want to wrangle environment variables. On macOS or Linux, the Continue.dev + Ollama guide is simpler. Choose LM Studio if you’re Windows-primary or want the LM Link remote-GPU feature.
What Continue.dev does differently with LM Studio vs Ollama
Continue.dev’s Ollama provider talks to Ollama’s native REST API (/api/generate, /api/chat, /api/tags) and uses Ollama’s FIM detection via the Modelfile template. The LM Studio provider takes a different path: it extends Continue.dev’s OpenAI class and points at LM Studio’s OpenAI-compatible server (http://localhost:1234/v1).
This means:
FIM works differently. For tab autocomplete, Continue.dev calls LM Studio’s /v1/completions endpoint with a suffix parameter — the standard OpenAI-compatible FIM path. This works reliably with Qwen2.5-Coder models (which include FIM training) and DeepSeek-Coder models. It fails silently with models that weren’t trained for FIM, producing generic “complete from where I left off” suggestions rather than true fill-in-the-middle completions.
The model name is decorative. Unlike Ollama, where Continue.dev queries /api/tags to verify the model exists, LM Studio’s API routes to the currently loaded model regardless of the name in your request. The model field in config.yaml is passed in the API call but LM Studio ignores it and uses whatever model you loaded in the GUI. This simplifies configuration but means you must manually pre-load the right model before starting your coding session.
Context length is a GUI setting, not an environment variable. Ollama has OLLAMA_NUM_CTX and per-model Modelfiles. In LM Studio, context length is configured at model-load time in the settings panel — and the default (4,096 tokens) is not enough for Continue.dev’s typical request size.
Hardware floor
| GPU / VRAM | Recommended model | Notes |
|---|---|---|
| RTX 4060 8 GB | Qwen2.5-Coder 7B Q4_K_M | Autocomplete only; chat produces marginal results |
| RTX 3060 12 GB | Qwen2.5-Coder 14B Q4_K_M | Practical floor for chat + edit; autocomplete on 1.5B separately |
| RTX 4060 Ti 16 GB | Qwen2.5-Coder 14B Q6_K | Solid daily-driver |
| RTX 3090 / RTX 4090 24 GB | Qwen2.5-Coder 32B Q4_K_M | Best local tier; Devstral Small 2 Q4_K_M also fits here |
| Mac M3/M4 unified memory | Use Ollama + MLX instead | LM Studio on Apple Silicon runs but Ollama + MLX is measurably faster |
LM Studio runs noticeably slower than Ollama on Apple Silicon because the macOS build still uses llama.cpp’s Metal path while Ollama has better integrated MLX support. If you’re on a Mac, the Continue.dev + Ollama guide will get you better performance. For hardware selection context, runaihome.com’s local AI model by VRAM tier guide covers the landscape in detail.
Step 1: Install LM Studio 0.4.15
Download from lmstudio.ai. The current stable release is 0.4.15 (build 2, released May 29, 2026). It ships as a single executable installer — .exe on Windows, .dmg on macOS, and AppImage/deb on Linux.
On Windows: run the installer. It detects your CUDA version automatically and installs the matching runtime. No manual CUDA path configuration needed.
On Linux:
chmod +x LM-Studio-0.4.15-x86_64.AppImage
./LM-Studio-0.4.15-x86_64.AppImage --no-sandbox
After launch, go to Settings → Developer Mode and toggle it on. This unlocks the local server controls and the parallel inference settings from LM Studio 0.4.0 onward.
Step 2: Download a coding model
Open the Discover tab and search for qwen2.5-coder. LM Studio shows available GGUF quantizations alongside estimated VRAM usage for each. For a 24 GB card, select Q4_K_M of the 32B variant (approximately 20 GB, leaving headroom for a 32k context window). For 12–16 GB cards, use the 14B at Q4_K_M (approximately 8 GB).
For the separate autocomplete model (recommended — it fires on every keystroke and needs to be fast), also download qwen2.5-coder-1.5b:
# Using the lms CLI that ships with LM Studio 0.4.x
lms get qwen2.5-coder-1.5b-instruct
# Verify download
lms ls
# Expected output: a list of model paths in your LM Studio models directory
The lms CLI is in your PATH after LM Studio installs. If the command isn’t found, open a fresh terminal — the installer adds it during the first launch.
Step 3: Set the context window — before loading, not after
This is where most Continue.dev + LM Studio setups silently break.
LM Studio defaults to a 4,096-token context window for most models. Continue.dev sends significantly more — file context, conversation history, and retrieved snippets combined can easily hit 8,000–16,000 tokens depending on your project size. When Continue.dev sends more than the loaded context window allows, LM Studio truncates the oldest tokens silently. The model never sees the earlier context. Responses look plausible but are based on an incomplete picture.
To fix this, set the context length in the model configuration before you click Load:
- In the left sidebar, click on the model you want to load
- In the right-side configuration panel, find Context Length (labeled
n_ctxin some versions) - Set it to at least 16384 — this covers most coding tasks
- For large codebases or long agent conversations, set it to 32768 (requires approximately 2–4 GB extra VRAM depending on the model)
- Click Load Model
The context length is baked in at load time. If you change it, you must unload and reload the model.
You can verify the context window is set correctly from the lms CLI after loading:
lms status
# Expected output includes: Context Length: 16384 (or whatever you set)
If you see Context Length: 4096 after loading, you changed the setting while the model was already loaded — it won’t apply until you reload.
Step 4: Start the local server
In LM Studio’s Developer panel, click Start Server. The server starts on port 1234 by default. Verify it’s responding:
curl http://localhost:1234/v1/models
# Expected: {"object":"list","data":[{"id":"<your model name>","object":"model",...}]}
This response tells you two things: the server is running, and the model’s API ID. Note the id value from this response — you’ll use it in the config.yaml.
If you need a different port (e.g., conflict with another service), change it in the Developer panel before starting. Update the apiBase in your config.yaml to match.
Step 5: Install Continue.dev
VS Code: Open Extensions (Ctrl+Shift+X / Cmd+Shift+X), search Continue, install the extension by Continue Dev, Inc. (v1.3.38 as of this writing). It appears as a sidebar panel.
JetBrains: Open Settings → Plugins → Marketplace, search Continue, install, and restart the IDE. The Continue panel appears in the right sidebar. Both VS Code and JetBrains read from the same ~/.continue/config.yaml, so configuring once covers both.
Step 6: Write the config.yaml
The config file lives at:
- macOS/Linux:
~/.continue/config.yaml - Windows:
%USERPROFILE%\.continue\config.yaml - Project override:
.continue/config.yamlat repo root (layered on top of global config)
Here is a complete LM Studio config. To use this, load the 14B model in LM Studio for chat and edit, or load the 1.5B for autocomplete. If your VRAM allows running two models simultaneously (LM Studio 0.4.x supports this in Developer Mode), load both and Continue.dev will route correctly by role.
name: LM Studio Local
version: 1.0.0
schema: v1
models:
# Tab autocomplete — small, fast FIM model
# Load qwen2.5-coder-1.5b-instruct in LM Studio for this role
- name: Qwen 1.5B (Autocomplete)
provider: lmstudio
model: qwen2.5-coder-1.5b-instruct
apiBase: http://localhost:1234/v1
roles:
- autocomplete
autocompleteOptions:
debounceDelay: 300
maxPromptTokens: 1024
onlyMyCode: true
# Chat and edit — 14B or 32B depending on your VRAM
# Load qwen2.5-coder-14b-instruct or 32b in LM Studio for these roles
- name: Qwen 14B (Chat + Edit)
provider: lmstudio
model: qwen2.5-coder-14b-instruct
apiBase: http://localhost:1234/v1
roles:
- chat
- edit
- apply
context:
providers:
- name: code
- name: docs
- name: diff
- name: terminal
- name: problems
- name: folder
- name: codebase
If you can only run one model at a time (8–12 GB VRAM), simplify the config to a single model covering all roles:
models:
- name: Qwen 14B (All Roles)
provider: lmstudio
model: qwen2.5-coder-14b-instruct
apiBase: http://localhost:1234/v1
roles:
- chat
- edit
- apply
- autocomplete
With a single model on autocomplete duty, tab completions will be slower (2–5 seconds on a 14B vs under 500ms on a 1.5B) but they’ll be higher quality.
Save the file. Continue.dev reloads config.yaml automatically — no IDE restart needed.
The FIM autocomplete test
Fill-in-the-middle completions are what makes tab autocomplete actually useful. Without FIM, the model can only continue from the cursor position forward — it doesn’t know what comes after. With FIM, it sees both the prefix (code before cursor) and suffix (code after cursor) and fills in only the missing part.
To verify FIM is working, open a code file and add a gap in the middle of a function:
def calculate_discount(price: float, pct: float) -> float:
# cursor here — type nothing, just wait
return discounted
Position your cursor on the blank line between the comment and the return statement. Within 300–500ms you should see a ghost-text suggestion like discounted = price * (1 - pct / 100). If you see a suggestion that tries to continue after the return statement (suggesting new functions or code past the end of the file), FIM is not working — the model is doing standard completion, not fill-in-the-middle.
If FIM completions aren’t appearing at all, check:
- Autocomplete is enabled in VS Code: Settings → Continue → enable tab autocomplete
- The model is FIM-capable: Qwen2.5-Coder and DeepSeek-Coder models support it; generic instruction-tuned models (Llama 3, Mistral Instruct) typically do not
- LM Studio’s
/v1/completionsendpoint is responding:curl -s http://localhost:1234/v1/completions -H 'Content-Type: application/json' -d '{"model":"test","prompt":"def hello","max_tokens":10}'should return a text completion
Troubleshooting
“Loading…” in JetBrains that never resolves: LM Studio’s server is either not started or the port isn’t accessible. Verify: open the Developer panel in LM Studio and confirm the server shows “Running on port 1234.” Then: curl http://localhost:1234/v1/models from your terminal — if this times out, check whether a firewall rule is blocking localhost traffic.
Chat returns nothing or cuts off mid-sentence: Context window too small. Stop the current chat, go to LM Studio → unload the model → raise Context Length in model settings → reload. Confirm with lms status that the context length is what you set.
Tab completions work for forward-continuation but not FIM: The loaded model doesn’t support fill-in-the-middle. Switch to Qwen2.5-Coder or DeepSeek-Coder 2 Lite. Both are available in LM Studio’s Discover tab and both have FIM training.
Model not responding after switching models in LM Studio GUI: Continue.dev’s connection to the server survives model switches because LM Studio keeps the server running. But there’s a brief window (3–10 seconds) after loading a new model where the API may return 503. If you see errors in Continue’s output panel, wait a moment and retry.
Remote access via LM Link: LM Studio 0.4.15 added end-to-end encrypted remote connections via Tailscale. If you’re serving from a desktop GPU to a laptop, install Tailscale on both machines, enable LM Link in LM Studio’s Developer panel, and replace http://localhost:1234/v1 with your LM Link endpoint URL in apiBase. The rest of the config.yaml is identical.
Where this setup hits its ceiling
Continue.dev + LM Studio at 14B gives you solid single-file work: function generation, test writing, inline documentation, quick refactors. The ceiling shows on tasks that require coherent reasoning across multiple files or iterative tool calls — multi-step edit sessions where the model must remember what it changed three steps ago, or codebase-level questions that require reasoning about how a dozen files relate.
At those limits, the honest options are: upgrade to a 32B model on a 24 GB card (Devstral Small 2 Q4_K_M or Qwen2.5-Coder 32B Q4_K_M), or accept the hybrid setup: local 1.5B for autocomplete, cloud API (Claude Sonnet 4.6 or GPT-4o) for chat and edit in Continue.dev. The Continue.dev review covers that hybrid configuration in detail.
FAQ
Can I use Continue.dev + LM Studio with JetBrains and VS Code at the same time?
Yes. Both IDEs read from ~/.continue/config.yaml simultaneously. LM Studio’s server handles concurrent requests. The only bottleneck is VRAM — both IDEs will queue requests to the same model.
Do I need to change the model name in config.yaml when I switch models in LM Studio?
Technically no — LM Studio routes API calls to the currently loaded model regardless of the name passed. But update the name for clarity. Leaving a stale model name in config.yaml causes confusion when reviewing logs or troubleshooting.
Does Continue.dev + LM Studio work offline, with no internet?
Yes, once the extension is installed and the model is downloaded. No internet connection is required for inference. The only outbound connections are localhost:1234.
Is the lmstudio provider better or worse than the ollama provider for Continue.dev?
For Windows developers: LM Studio is more reliable due to better CUDA detection. For macOS/Linux: Ollama is faster because it has native API endpoints and better MLX integration on Apple Silicon. Feature-wise they’re equivalent — both support FIM, chat, and edit roles.
What’s the minimum VRAM to run both autocomplete and chat models simultaneously?
For a 1.5B autocomplete model (Q4, ~1 GB) plus a 14B chat model (Q4_K_M, ~8 GB), you need approximately 10–11 GB available. An RTX 3060 12 GB works; an RTX 4060 8 GB does not.
Sources
- Continue.dev v1.3.38-vscode release — GitHub
- Continue.dev LM Studio provider source — GitHub
- LM Studio bug tracker — lmstudio-ai/lmstudio-bug-tracker
- Continue.dev + Ollama 2026: local AI coding setup — AICoderScope
- Cline + LM Studio 2026: complete setup guide — AICoderScope
- Continue.dev Review 2026 — AICoderScope
- Best Local AI Models by VRAM tier — runaihome.com
Last updated Jun 5, 2026. LM Studio version and Continue.dev version change frequently — verify current releases before setup.
Recommended Gear
- RTX 4060 (8 GB VRAM) — entry tier for autocomplete-only local setup
- RTX 3060 12 GB — practical floor for chat + edit on 14B Q4
- RTX 4060 Ti 16 GB — solid daily-driver tier for local coding
- RTX 3090 (24 GB) — runs 32B Q4, dual-model autocomplete + chat
- RTX 4090 (24 GB) — fastest single-card option for 32B local inference
Was this article helpful?
Thanks for the feedback — it helps improve future articles.