Jun 5, 2026

Continue.dev + LM Studio 2026: setup guide, the context-window dial you must set before loading, and which GGUF models pass the FIM test

By AICoderScope Team · 14 min read

continue-devlm-studiolocal-llmsetup-guidevscodejetbrainsqwenfimautocompleteprivacy

TL;DR: Continue.dev v1.3.38 + LM Studio 0.4.15 gives you local AI coding in both VS Code and JetBrains — with a GUI model browser, automatic CUDA detection, and optional remote GPU access via LM Link. One trap stops most setups before they produce good output: LM Studio’s context window defaults to 4,096 tokens and you must increase it in the model settings before loading, not after. Miss that step and Continue.dev silently feeds the model 20% of the context it requested.

What you’ll be able to do after this guide:

Serve any GGUF coding model from LM Studio at http://localhost:1234/v1
Configure Continue.dev with separate model roles — a lightweight 1.5B for tab autocomplete, a 14B or 32B for chat and edits — using a single config.yaml
Get fill-in-the-middle (FIM) tab completions working in VS Code and JetBrains

	Continue.dev + LM Studio	Continue.dev + Ollama	Cursor Pro
Best for	Windows + GUI model browser + LM Link	macOS / Linux, CLI-first	Best-in-class VS Code agent
Price / Cost	$0, no API bill	$0, no API bill	$20/mo, usage-capped
The catch	LM Studio is a multi-hundred-MB GUI app; no headless install	No GUI, needs CUDA path setup on Windows	No local model option at all

Honest take: On Windows, LM Studio is the lower-friction path to local Continue.dev — CUDA auto-detection and a visual model browser beat Ollama’s CLI for developers who don’t want to wrangle environment variables. On macOS or Linux, the Continue.dev + Ollama guide is simpler. Choose LM Studio if you’re Windows-primary or want the LM Link remote-GPU feature.

What Continue.dev does differently with LM Studio vs Ollama

Continue.dev’s Ollama provider talks to Ollama’s native REST API (/api/generate, /api/chat, /api/tags) and uses Ollama’s FIM detection via the Modelfile template. The LM Studio provider takes a different path: it extends Continue.dev’s OpenAI class and points at LM Studio’s OpenAI-compatible server (http://localhost:1234/v1).

This means:

FIM works differently. For tab autocomplete, Continue.dev calls LM Studio’s /v1/completions endpoint with a suffix parameter — the standard OpenAI-compatible FIM path. This works reliably with Qwen2.5-Coder models (which include FIM training) and DeepSeek-Coder models. It fails silently with models that weren’t trained for FIM, producing generic “complete from where I left off” suggestions rather than true fill-in-the-middle completions.

The model name is decorative. Unlike Ollama, where Continue.dev queries /api/tags to verify the model exists, LM Studio’s API routes to the currently loaded model regardless of the name in your request. The model field in config.yaml is passed in the API call but LM Studio ignores it and uses whatever model you loaded in the GUI. This simplifies configuration but means you must manually pre-load the right model before starting your coding session.

Context length is a GUI setting, not an environment variable. Ollama has OLLAMA_NUM_CTX and per-model Modelfiles. In LM Studio, context length is configured at model-load time in the settings panel — and the default (4,096 tokens) is not enough for Continue.dev’s typical request size.

Hardware floor

GPU / VRAM	Recommended model	Notes
RTX 4060 8 GB	Qwen2.5-Coder 7B Q4_K_M	Autocomplete only; chat produces marginal results
RTX 3060 12 GB	Qwen2.5-Coder 14B Q4_K_M	Practical floor for chat + edit; autocomplete on 1.5B separately
RTX 4060 Ti 16 GB	Qwen2.5-Coder 14B Q6_K	Solid daily-driver
RTX 3090 / RTX 4090 24 GB	Qwen2.5-Coder 32B Q4_K_M	Best local tier; Devstral Small 2 Q4_K_M also fits here
Mac M3/M4 unified memory	Use Ollama + MLX instead	LM Studio on Apple Silicon runs but Ollama + MLX is measurably faster

LM Studio runs noticeably slower than Ollama on Apple Silicon because the macOS build still uses llama.cpp’s Metal path while Ollama has better integrated MLX support. If you’re on a Mac, the Continue.dev + Ollama guide will get you better performance. For hardware selection context, runaihome.com’s local AI model by VRAM tier guide covers the landscape in detail.

Step 1: Install LM Studio 0.4.15

Download from lmstudio.ai. The current stable release is 0.4.15 (build 2, released May 29, 2026). It ships as a single executable installer — .exe on Windows, .dmg on macOS, and AppImage/deb on Linux.

On Windows: run the installer. It detects your CUDA version automatically and installs the matching runtime. No manual CUDA path configuration needed.

On Linux:

chmod +x LM-Studio-0.4.15-x86_64.AppImage
./LM-Studio-0.4.15-x86_64.AppImage --no-sandbox

After launch, go to Settings → Developer Mode and toggle it on. This unlocks the local server controls and the parallel inference settings from LM Studio 0.4.0 onward.

Step 2: Download a coding model

Open the Discover tab and search for qwen2.5-coder. LM Studio shows available GGUF quantizations alongside estimated VRAM usage for each. For a 24 GB card, select Q4_K_M of the 32B variant (approximately 20 GB, leaving headroom for a 32k context window). For 12–16 GB cards, use the 14B at Q4_K_M (approximately 8 GB).

For the separate autocomplete model (recommended — it fires on every keystroke and needs to be fast), also download qwen2.5-coder-1.5b:

# Using the lms CLI that ships with LM Studio 0.4.x
lms get qwen2.5-coder-1.5b-instruct

# Verify download
lms ls
# Expected output: a list of model paths in your LM Studio models directory

The lms CLI is in your PATH after LM Studio installs. If the command isn’t found, open a fresh terminal — the installer adds it during the first launch.

Step 3: Set the context window — before loading, not after

This is where most Continue.dev + LM Studio setups silently break.

LM Studio defaults to a 4,096-token context window for most models. Continue.dev sends significantly more — file context, conversation history, and retrieved snippets combined can easily hit 8,000–16,000 tokens depending on your project size. When Continue.dev sends more than the loaded context window allows, LM Studio truncates the oldest tokens silently. The model never sees the earlier context. Responses look plausible but are based on an incomplete picture.

To fix this, set the context length in the model configuration before you click Load:

In the left sidebar, click on the model you want to load
In the right-side configuration panel, find Context Length (labeled n_ctx in some versions)
Set it to at least 16384 — this covers most coding tasks
For large codebases or long agent conversations, set it to 32768 (requires approximately 2–4 GB extra VRAM depending on the model)
Click Load Model

The context length is baked in at load time. If you change it, you must unload and reload the model.

You can verify the context window is set correctly from the lms CLI after loading:

lms status
# Expected output includes: Context Length: 16384 (or whatever you set)

If you see Context Length: 4096 after loading, you changed the setting while the model was already loaded — it won’t apply until you reload.

Step 4: Start the local server

In LM Studio’s Developer panel, click Start Server. The server starts on port 1234 by default. Verify it’s responding:

curl http://localhost:1234/v1/models
# Expected: {"object":"list","data":[{"id":"<your model name>","object":"model",...}]}

This response tells you two things: the server is running, and the model’s API ID. Note the id value from this response — you’ll use it in the config.yaml.

If you need a different port (e.g., conflict with another service), change it in the Developer panel before starting. Update the apiBase in your config.yaml to match.

Step 5: Install Continue.dev

VS Code: Open Extensions (Ctrl+Shift+X / Cmd+Shift+X), search Continue, install the extension by Continue Dev, Inc. (v1.3.38 as of this writing). It appears as a sidebar panel.

JetBrains: Open Settings → Plugins → Marketplace, search Continue, install, and restart the IDE. The Continue panel appears in the right sidebar. Both VS Code and JetBrains read from the same ~/.continue/config.yaml, so configuring once covers both.

Step 6: Write the config.yaml

The config file lives at:

macOS/Linux: ~/.continue/config.yaml
Windows: %USERPROFILE%\.continue\config.yaml
Project override: .continue/config.yaml at repo root (layered on top of global config)

Here is a complete LM Studio config. To use this, load the 14B model in LM Studio for chat and edit, or load the 1.5B for autocomplete. If your VRAM allows running two models simultaneously (LM Studio 0.4.x supports this in Developer Mode), load both and Continue.dev will route correctly by role.

name: LM Studio Local
version: 1.0.0
schema: v1

models:
  # Tab autocomplete — small, fast FIM model
  # Load qwen2.5-coder-1.5b-instruct in LM Studio for this role
  - name: Qwen 1.5B (Autocomplete)
    provider: lmstudio
    model: qwen2.5-coder-1.5b-instruct
    apiBase: http://localhost:1234/v1
    roles:
      - autocomplete
    autocompleteOptions:
      debounceDelay: 300
      maxPromptTokens: 1024
      onlyMyCode: true

  # Chat and edit — 14B or 32B depending on your VRAM
  # Load qwen2.5-coder-14b-instruct or 32b in LM Studio for these roles
  - name: Qwen 14B (Chat + Edit)
    provider: lmstudio
    model: qwen2.5-coder-14b-instruct
    apiBase: http://localhost:1234/v1
    roles:
      - chat
      - edit
      - apply

context:
  providers:
    - name: code
    - name: docs
    - name: diff
    - name: terminal
    - name: problems
    - name: folder
    - name: codebase

If you can only run one model at a time (8–12 GB VRAM), simplify the config to a single model covering all roles:

models:
  - name: Qwen 14B (All Roles)
    provider: lmstudio
    model: qwen2.5-coder-14b-instruct
    apiBase: http://localhost:1234/v1
    roles:
      - chat
      - edit
      - apply
      - autocomplete

With a single model on autocomplete duty, tab completions will be slower (2–5 seconds on a 14B vs under 500ms on a 1.5B) but they’ll be higher quality.

Save the file. Continue.dev reloads config.yaml automatically — no IDE restart needed.

The FIM autocomplete test

Fill-in-the-middle completions are what makes tab autocomplete actually useful. Without FIM, the model can only continue from the cursor position forward — it doesn’t know what comes after. With FIM, it sees both the prefix (code before cursor) and suffix (code after cursor) and fills in only the missing part.

To verify FIM is working, open a code file and add a gap in the middle of a function:

def calculate_discount(price: float, pct: float) -> float:
    # cursor here — type nothing, just wait
    return discounted

Position your cursor on the blank line between the comment and the return statement. Within 300–500ms you should see a ghost-text suggestion like discounted = price * (1 - pct / 100). If you see a suggestion that tries to continue after the return statement (suggesting new functions or code past the end of the file), FIM is not working — the model is doing standard completion, not fill-in-the-middle.

If FIM completions aren’t appearing at all, check:

Autocomplete is enabled in VS Code: Settings → Continue → enable tab autocomplete
The model is FIM-capable: Qwen2.5-Coder and DeepSeek-Coder models support it; generic instruction-tuned models (Llama 3, Mistral Instruct) typically do not
LM Studio’s /v1/completions endpoint is responding: curl -s http://localhost:1234/v1/completions -H 'Content-Type: application/json' -d '{"model":"test","prompt":"def hello","max_tokens":10}' should return a text completion

Troubleshooting

“Loading…” in JetBrains that never resolves: LM Studio’s server is either not started or the port isn’t accessible. Verify: open the Developer panel in LM Studio and confirm the server shows “Running on port 1234.” Then: curl http://localhost:1234/v1/models from your terminal — if this times out, check whether a firewall rule is blocking localhost traffic.

Chat returns nothing or cuts off mid-sentence: Context window too small. Stop the current chat, go to LM Studio → unload the model → raise Context Length in model settings → reload. Confirm with lms status that the context length is what you set.

Tab completions work for forward-continuation but not FIM: The loaded model doesn’t support fill-in-the-middle. Switch to Qwen2.5-Coder or DeepSeek-Coder 2 Lite. Both are available in LM Studio’s Discover tab and both have FIM training.

Model not responding after switching models in LM Studio GUI: Continue.dev’s connection to the server survives model switches because LM Studio keeps the server running. But there’s a brief window (3–10 seconds) after loading a new model where the API may return 503. If you see errors in Continue’s output panel, wait a moment and retry.

Remote access via LM Link: LM Studio 0.4.15 added end-to-end encrypted remote connections via Tailscale. If you’re serving from a desktop GPU to a laptop, install Tailscale on both machines, enable LM Link in LM Studio’s Developer panel, and replace http://localhost:1234/v1 with your LM Link endpoint URL in apiBase. The rest of the config.yaml is identical.

Where this setup hits its ceiling

Continue.dev + LM Studio at 14B gives you solid single-file work: function generation, test writing, inline documentation, quick refactors. The ceiling shows on tasks that require coherent reasoning across multiple files or iterative tool calls — multi-step edit sessions where the model must remember what it changed three steps ago, or codebase-level questions that require reasoning about how a dozen files relate.

At those limits, the honest options are: upgrade to a 32B model on a 24 GB card (Devstral Small 2 Q4_K_M or Qwen2.5-Coder 32B Q4_K_M), or accept the hybrid setup: local 1.5B for autocomplete, cloud API (Claude Sonnet 4.6 or GPT-4o) for chat and edit in Continue.dev. The Continue.dev review covers that hybrid configuration in detail.

FAQ

Can I use Continue.dev + LM Studio with JetBrains and VS Code at the same time?
Yes. Both IDEs read from ~/.continue/config.yaml simultaneously. LM Studio’s server handles concurrent requests. The only bottleneck is VRAM — both IDEs will queue requests to the same model.

Do I need to change the model name in config.yaml when I switch models in LM Studio?
Technically no — LM Studio routes API calls to the currently loaded model regardless of the name passed. But update the name for clarity. Leaving a stale model name in config.yaml causes confusion when reviewing logs or troubleshooting.

Does Continue.dev + LM Studio work offline, with no internet?
Yes, once the extension is installed and the model is downloaded. No internet connection is required for inference. The only outbound connections are localhost:1234.

Is the lmstudio provider better or worse than the ollama provider for Continue.dev?
For Windows developers: LM Studio is more reliable due to better CUDA detection. For macOS/Linux: Ollama is faster because it has native API endpoints and better MLX integration on Apple Silicon. Feature-wise they’re equivalent — both support FIM, chat, and edit roles.

What’s the minimum VRAM to run both autocomplete and chat models simultaneously?
For a 1.5B autocomplete model (Q4, ~1 GB) plus a 14B chat model (Q4_K_M, ~8 GB), you need approximately 10–11 GB available. An RTX 3060 12 GB works; an RTX 4060 8 GB does not.

Sources

Last updated Jun 5, 2026. LM Studio version and Continue.dev version change frequently — verify current releases before setup.

Recommended Gear

RTX 4060 (8 GB VRAM) — entry tier for autocomplete-only local setup
RTX 3060 12 GB — practical floor for chat + edit on 14B Q4
RTX 4060 Ti 16 GB — solid daily-driver tier for local coding
RTX 3090 (24 GB) — runs 32B Q4, dual-model autocomplete + chat
RTX 4090 (24 GB) — fastest single-card option for 32B local inference

Was this article helpful?