Aider + LM Studio 2026: setup guide, the output-token ceiling that truncates diffs, and which models actually hold up

aiderlm-studiolocal-llmsetup-guideqwendevstralprivacypair-programming

TL;DR: Aider v0.86.0 connects to LM Studio 0.4.15’s local server in three environment variables and a model prefix — but there is one silent failure mode that will truncate your code diffs mid-function with no error message. This guide sets everything up correctly, fixes that problem before it bites you, and tells you which 2026 coding models are worth loading.

What you’ll be able to do after this guide:

  • Serve a quantized coding model from LM Studio 0.4.15 on http://localhost:1234/v1
  • Run Aider against it using the correct model prefix and environment variables
  • Override the output-token ceiling in .aider.model.settings.yml so diffs don’t get cut in half
  • Pick a model matched to your actual VRAM

Honest take: If you already use Ollama and it’s working, stay on Ollama — the Aider + Ollama guide covers that path and Ollama’s simpler on Apple Silicon and Linux. LM Studio earns the swap on two specific setups: Windows machines where the polished CUDA auto-detection saves you twenty minutes of troubleshooting, and developers with a home desktop GPU they want to serve to a lightweight laptop via LM Link.


Why LM Studio over Ollama for Aider

The Aider + Ollama guide documents the Ollama path. LM Studio earns its own article for specific cases.

Windows-first hardware. Ollama on Windows has improved in 2026 but still needs occasional CUDA path fiddling. LM Studio’s Windows installer auto-detects your CUDA version and runtime; most developers are up and serving in under five minutes.

GUI model browser with VRAM estimates. Searching for a model in LM Studio’s Discover tab shows GGUF quantization options, the approximate VRAM footprint for each, and whether your GPU can load it. For someone new to quantization levels, seeing “Q4_K_M — 20.1 GB” next to a model removes a lot of guesswork.

Parallel inference since 0.4.0. LM Studio 0.4.0 (February 2026) introduced the llmster daemon for concurrent request processing. Aider can issue rapid sequential tool calls — reading multiple files, applying edits, checking output — and the queuing behavior in older LM Studio versions created visible stalls between steps. With parallel inference on, the gaps shrink.

LM Link for remote GPUs. LM Studio 0.4.15 (May 29, 2026) added end-to-end encrypted remote connections via Tailscale. If you code on a MacBook Pro but have a desktop with a 24 GB GPU at home, you can serve models from the desktop and point Aider at the LM Link address without changing any other configuration.

The tradeoff is real: LM Studio is a several-hundred-MB GUI application. Ollama is a single binary. On a headless server or Apple Silicon Mac, Ollama runs leaner and faster.


Hardware floor

Aider generates complete code edits as diffs, then applies them to your files and auto-commits. That workflow requires the model to hold context across multiple files, understand the existing code structure, and produce syntactically correct diffs. 7B models fail this regularly.

HardwareRecommended modelNotes
RTX 4060 8 GBQwen3-8B Q4_K_M~5 GB VRAM; single-file edits only, multi-file agentic tasks unreliable
RTX 3060 12 GBQwen2.5-Coder-14B Q4_K_MMinimum practical floor; handles most everyday Aider work
RTX 4060 Ti 16 GBQwen3.6-27B Q3_K_MGood daily driver; slower than 14B but noticeably more coherent on complex refactors
RTX 3090 / RTX 4090 24 GBDevstral Small 2 Q4_K_M24B params, 68% SWE-bench Verified; best local option for agentic coding in 2026
Mac M3/M4 unifiedUse Ollama + MLXLM Studio on Apple Silicon exists but Ollama + MLX runs faster for most coding models

Devstral Small 2 (released May 2026 by Mistral AI) is the current ceiling for what runs on consumer hardware — 24B parameters, 68% on SWE-bench Verified, fits on a 24 GB card at Q4_K_M with room for a 32k context window. The RTX 4090 is the practical target GPU for that model. For hardware context, runaihome.com’s local AI model by VRAM tier guide covers the tradeoffs in detail.


Step 1: Install LM Studio 0.4.15

Download from lmstudio.ai. The current stable release is 0.4.15 (build 2, May 29, 2026). The installer comes as a .exe (Windows), .dmg (macOS), or AppImage/deb (Linux).

On Windows, run the installer — it handles CUDA detection automatically. On Linux:

chmod +x LM-Studio-0.4.15-x86_64.AppImage
./LM-Studio-0.4.15-x86_64.AppImage --no-sandbox

The --no-sandbox flag is needed on some distributions; try without it first.

LM Studio is free for both personal and commercial use as of 2025 (the policy change was announced on the LM Studio blog — no license form, no paid tier required).

Once installed, go to Settings → Developer Mode and enable it. LM Studio 0.4.0 merged the old Developer and Power User panels; Developer Mode unlocks the server tab and parallel inference settings you’ll need next.


Step 2: Download a coding model

Open the Discover tab and search for your model. For a 24 GB card, search devstral-small-2 and select the Q4_K_M GGUF. LM Studio shows the estimated VRAM next to each quantization option.

Alternatively, from the lms CLI that ships with LM Studio 0.4:

# List available models matching a name
lms search devstral

# Download the Q4_K_M quantization interactively
lms get "bartowski/Devstral-Small-2-2506-GGUF"

# Verify it downloaded
lms ls

After download, LM Studio shows the model in your local models list.


Step 3: Start the local server

In LM Studio, go to the Developer tab (visible once Developer Mode is on). Select your loaded model from the dropdown and click Start Server. The default port is 1234.

For parallel inference, open Advanced Server Settings and set Concurrent Request Limit to 4 or more. With a single Aider session, this isn’t critical, but it prevents request queuing if you run multiple terminals or additional tooling simultaneously.

Verify the server is running:

curl http://localhost:1234/v1/models

The response lists every loaded model with its exact ID string — you’ll need that string in the next step.


Step 4: Connect Aider

Install Aider if you haven’t:

pip install aider-chat

The current version is v0.86.0. Set two environment variables before running Aider:

export OPENAI_API_BASE=http://localhost:1234/v1
export OPENAI_API_KEY=lm-studio   # any non-empty string; LM Studio ignores the value on localhost

On Windows (PowerShell):

$env:OPENAI_API_BASE = "http://localhost:1234/v1"
$env:OPENAI_API_KEY  = "lm-studio"

Then run Aider with the openai/ prefix and the model ID from /v1/models:

aider --model openai/devstral-small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf

You should see Aider’s startup banner and a prompt. If you get “model not found,” skip to the model-ID trap section below.


The model-ID trap

This is the first place most setups break. LM Studio generates model IDs that include the full GGUF path — organization, repository name, and quantization suffix:

bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf

The exact string varies by how the model was downloaded and which community packaged it. Get the right string from the live server before constructing your Aider command:

curl -s http://localhost:1234/v1/models | python3 -m json.tool | grep '"id"'

Output looks like:

"id": "bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf",

Copy that exact string and use it as your model name after openai/:

aider --model "openai/bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf"

To avoid typing this every session, create .aider.conf.yml in your home directory or project root:

model: openai/bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf

The output-token ceiling — and why it breaks Aider specifically

This is the non-obvious failure mode that distinguishes Aider+LM Studio from a simple chat app.

LM Studio’s local server silently caps output token generation at approximately 9,500–16,000 tokens, regardless of what max_tokens value the client requests. The cap is undocumented and produces no error — the response simply stops mid-stream when it hits the ceiling.

For a chat interface, this rarely matters. Most answers are well under 10k tokens.

For Aider, it’s a different situation. Aider generates code changes as structured diffs and sometimes needs to output complete files or long replacement blocks. When a diff gets cut at 10k tokens, the result is a partial function body that Aider tries to apply to your file — which either causes a parse error or silently inserts broken code. You won’t see a “max_tokens exceeded” message; you’ll see a diff that ends mid-function.

The fix is to tell Aider to expect shorter outputs per request and stay within the ceiling. Create .aider.model.settings.yml in your home directory or project root:

- name: openai/bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf
  max_tokens: 8192
  context_window: 32768

max_tokens: 8192 tells Aider to request outputs that stay below LM Studio’s ceiling. context_window: 32768 tells Aider how much input context is available — Aider uses this to decide how much of your codebase to include in each prompt. Without this setting, Aider may assume a much smaller or larger window and either over-stuff or under-utilize the context.

To confirm the settings loaded:

aider --model "openai/bartowski/Devstral-Small-2-2506-GGUF/devstral-small-2-2506-Q4_K_M.gguf" --show-model-warnings

Aider prints the model metadata it resolved. Look for context_window and max_tokens in the output to confirm your YAML was picked up.


Headless use with the lms CLI

If you’re running LM Studio on a remote machine without a desktop, the lms CLI (shipped with LM Studio 0.4) handles server management:

# Start the server on the default port
lms server start

# Load a specific model
lms load "bartowski/Devstral-Small-2-2506-GGUF" --gpu max

# Check what's running
lms ps

# Stop the server
lms server stop

--gpu max offloads as many layers as possible to the GPU. On a machine with no GPU, drop the flag and LM Studio runs CPU inference — slower but functional for lighter models.

Once the server is running via CLI, your Aider setup is identical. Point OPENAI_API_BASE at the remote machine’s IP instead of localhost:

export OPENAI_API_BASE=http://192.168.1.50:1234/v1

LM Studio 0.4.15 added LM Link, an end-to-end encrypted tunnel built on Tailscale that lets you connect to a remote LM Studio instance. The workflow: run LM Studio on your GPU desktop, enable LM Link in Settings, and from your laptop point Aider at the LM Link address instead of localhost. The model serving, VRAM, and inference all stay on the desktop.

This is particularly useful for developers with a high-end home desktop but a lighter travel machine. The latency is Tailscale network latency — typically under 10ms on a home network — rather than the LAN UDP round-trip, so it’s effectively local speed for most coding tasks.

Setup is in LM Studio Settings → LM Link → Enable, then copy the address shown and substitute it for localhost:1234 in OPENAI_API_BASE.


When to switch back to Ollama

LM Studio isn’t the right choice for every setup:

  • Apple Silicon Mac: Ollama with MLX backend runs faster for most coding models. LM Studio supports Apple Silicon but the performance gap is noticeable on M3/M4 chips.
  • Headless Linux server without display: The lms CLI works but the installation still requires an AppImage and some workarounds. Ollama is a single binary that installs and runs cleanly.
  • Simpler model ID management: Ollama model names are clean (qwen2.5-coder:14b). LM Studio model IDs include full GGUF paths and need the curl-to-grep step above.

The Aider + Ollama guide covers the Ollama path. The two setups share the same Aider concepts — only the env vars and model prefix differ.


FAQ

Does LM Studio cost anything? No. LM Studio is free for personal and commercial use as of 2025. There is no paid tier for the local server functionality.

What API key should I pass? Any non-empty string. LM Studio’s local server ignores the API key value on localhost; the field just needs to be present to satisfy the OpenAI SDK constructor. "lm-studio" is the conventional placeholder.

Why does Aider print “Unknown model” warnings? Because Aider doesn’t have built-in pricing or context window metadata for custom local models accessed via the openai/ prefix. The .aider.model.settings.yml file suppresses this by providing the metadata explicitly. The --no-show-model-warnings flag suppresses the warning without fixing the underlying gap — use the YAML instead.

Can I use Devstral Small 2 for the architect/editor split mode in Aider? Yes. Aider supports --architect mode where a larger model reasons about changes and a smaller model applies them. With two LM Studio instances on different ports, you can set one model as the architect and another as the editor:

aider --architect --model openai/<big-model> --editor-model openai/<small-model>

My diffs are getting cut mid-function — what’s wrong? This is the LM Studio output ceiling. Set max_tokens: 8192 in .aider.model.settings.yml for your model. If edits still truncate at that setting, reduce the file sizes you’re editing in a single Aider prompt by using /add to include only the relevant files.


Sources

Last updated Jun 04, 2026. LM Studio releases frequently; verify the current version at lmstudio.ai before following installation steps.

Was this article helpful?