Jun 6, 2026

Cursor + Ollama and LM Studio in 2026: use local models for Chat and Cmd+K — and keep tab completion honest

By AICoderScope Team · 11 min read

cursorollamalm-studiolocal-llmsetup-guideqwenprivacycost

TL;DR: You can route Cursor’s Chat panel and Cmd+K through a local model running on your own machine — zero API spend for those features. The CORS header and the correct base URL (http://localhost:11434/v1 for Ollama, http://localhost:1234/v1 for LM Studio) are all you need. The hard limit: Cursor’s Tab autocomplete stays cloud-only regardless of what you configure. If Tab is your primary use case, this setup won’t help.

	Ollama path	LM Studio path	Neither (stay cloud)
Best for	macOS, Linux, Apple Silicon	Windows with CUDA GPU	Heavy Tab users
Cost to run	Free (hardware only)	Free (hardware only)	$20–$200/mo (Pro tier)
Tab completion	❌ Still cloud-only	❌ Still cloud-only	✅ Unlimited on Pro+
The catch	OLLAMA_ORIGINS env var required	GUI-only model loading	Credit pool burns fast with Claude Sonnet

Honest take: If you spend more than an hour a day in Cursor’s Chat panel asking architectural questions, explaining code, or running long Cmd+K rewrites, switching those requests to a local Qwen2.5-Coder-32B drops that API spend to zero. If 80% of your Cursor value comes from Tab autocomplete, local models add nothing.

What actually happens when Cursor talks to a local model

Cursor’s AI features split into two architecturally different systems:

Tab autocomplete runs through Cursor’s proprietary server-side model — a small, fast transformer trained for fill-in-the-middle (FIM) completions. This is not OpenAI, not Claude. Cursor controls it, it runs on Cursor’s infrastructure, and you cannot swap it out. The Override Base URL setting in Cursor’s model panel has no effect on Tab.

Chat, Cmd+K, and Agent mode use the OpenAI API format and are called from the Cursor client running on your local machine. When you override the base URL, Cursor sends chat requests directly from your VS Code process to whatever endpoint you’ve configured — Ollama on localhost:11434, LM Studio on localhost:1234, or a remote server. The model credit pool from your Cursor Pro subscription is not consumed for these calls.

This architecture is why local model substitution is meaningful but partial.

Hardware floor and model selection

Chat and Cmd+K are less latency-sensitive than Tab autocomplete — you typically wait for a full response. A 14B model on a mid-range GPU is usable; a 7B model can handle single-function questions but starts to drift on larger refactoring prompts.

GPU / VRAM	Recommended model	Ollama pull command
8 GB (RTX 4060 / 8 GB Apple M)	`qwen2.5-coder:7b`	`ollama pull qwen2.5-coder:7b`
12 GB (RTX 3060 12 GB)	`qwen2.5-coder:14b`	`ollama pull qwen2.5-coder:14b`
16 GB (RTX 4060 Ti 16 GB)	`qwen2.5-coder:14b` or `devstral:24b-small` (Q4)	`ollama pull qwen2.5-coder:14b`
24 GB (RTX 3090 / RTX 4090)	`qwen2.5-coder:32b`	`ollama pull qwen2.5-coder:32b`
Apple M3/M4 Max (36–128 GB unified)	`qwen2.5-coder:32b`	`ollama pull qwen2.5-coder:32b`

Qwen2.5-Coder-32B scores 92.7% on HumanEval and 73.7 on Aider’s pass-rate benchmark — within a few points of GPT-4o. On a 24 GB GPU it fits comfortably at Q4_K_M quantization (~19 GB loaded). For hardware advice on building a local AI rig, runaihome.com’s local AI model by VRAM tier covers the full GPU comparison.

The 7B tier is workable for explaining snippets and one-shot Cmd+K edits. For anything that requires tracking a refactoring plan across multiple files, 14B is the practical minimum.

Path 1: Cursor + Ollama

Step 1 — Install Ollama and set the CORS header

Download and install Ollama from ollama.com. On macOS and Linux, Ollama runs as a background service after installation. On Windows, it installs as a system tray application.

Before pulling any model, set the OLLAMA_ORIGINS environment variable. This is the step most guides skip and the reason Cursor throws a CORS error on the first request.

macOS / Linux (add to ~/.zshrc or ~/.bashrc):

export OLLAMA_ORIGINS="*"

Windows (run in a terminal, then restart Ollama):

setx OLLAMA_ORIGINS "*"

After setting the variable, restart the Ollama service so it picks up the change:

# macOS / Linux
pkill ollama
ollama serve

# Or restart via the macOS menu bar icon

Step 2 — Pull a model

ollama pull qwen2.5-coder:7b

Expected output:

pulling manifest
pulling 966de95ca8a6... 100% ▕████████████████████████████████▏ 4.7 GB
pulling 66b9ea09bd5b... 100% ▕████████████████████████████████▏  68 B
pulling e7fed4a1ded7... 100% ▕████████████████████████████████▏  4.8 KB
verifying sha256 digest
writing manifest
success

Verify the model is loaded and the API is live:

curl http://localhost:11434/v1/models

You should see a JSON response listing qwen2.5-coder:7b. If you get a connection refused error, Ollama isn’t running — launch it with ollama serve.

Step 3 — Configure Cursor

Open Cursor and press Cmd+, (macOS) or Ctrl+, (Windows/Linux) to open settings.
Click Cursor Settings (not VS Code Settings) in the top-right or via the gear icon.
Navigate to the Models tab.
Scroll to the OpenAI API Key section.
Toggle on Override OpenAI Base URL.
Enter: http://localhost:11434/v1
In the API Key field, enter any non-empty string — Ollama doesn’t validate keys, but Cursor requires the field to be non-empty. ollama works fine.
Click Add Model and type the exact model name as it appears in Ollama: qwen2.5-coder:7b
In the model list, deselect all other models — leave only your local model checked. This prevents the “does not work with your current plan” error that appears when Cursor tries to route a request to a premium model.

Now open a file, press Cmd+L to open the chat panel, and type a question. The response comes from your local Ollama instance.

If localhost doesn’t connect

A small percentage of users — mostly those behind corporate firewalls or VPNs — find that localhost:11434 doesn’t resolve correctly from Cursor. The symptom is a timeout or “network error” in the chat panel despite Ollama running fine. Fix: use the loopback IP explicitly instead of the hostname:

Change the base URL to: http://127.0.0.1:11434/v1

If that also fails, the request is being intercepted by a network proxy. The workaround is to expose Ollama through ngrok:

ngrok http 11434 --host-header="localhost:11434"

ngrok prints a public HTTPS URL like https://abc123.ngrok-free.app. Use that as your Cursor base URL: https://abc123.ngrok-free.app/v1. Note that the free ngrok tier generates a new URL on every restart, so you’d need to update Cursor’s settings each time.

Path 2: Cursor + LM Studio

LM Studio (stable release: 0.4.15, May 29, 2026) is the better choice on Windows — it has a GUI model browser, automatic CUDA detection, and a one-click server start. LM Studio’s GGUF library includes all major coding models.

Step 1 — Install LM Studio and load a model

Download from lmstudio.ai. Run the installer; it auto-detects your CUDA version and driver.

Inside LM Studio:

Click Discover (the search icon) in the left sidebar.
Search for qwen2.5-coder.
Choose the variant matching your VRAM — Q4_K_M for 8 GB and 12 GB cards, Q6_K for 16 GB and up.
Click Download.

Step 2 — Start the local server

Click Developer in the left sidebar (the </> icon).
Select your downloaded model from the dropdown.
Click Start Server.

LM Studio’s server starts on localhost:1234 by default. You can confirm it’s running by checking the status indicator — it turns green and shows “Server running on port 1234.”

Verify via terminal:

curl http://localhost:1234/v1/models

Expected output includes the model identifier you loaded, which looks like:

{"data":[{"id":"qwen2.5-coder-7b-instruct","object":"model",...}]}

Copy the id value exactly — you’ll need it in Cursor.

Step 3 — Configure Cursor

Same flow as the Ollama path:

Cursor Settings → Models → OpenAI section.
Override Base URL: http://localhost:1234/v1
API Key: lm-studio (any non-empty value)
Add Model: paste the id you copied from LM Studio’s API response (e.g., qwen2.5-coder-7b-instruct)
Deselect all other models.

LM Studio does not require the OLLAMA_ORIGINS workaround — its server handles cross-origin requests by default.

The “agent and edit rely on custom models that cannot be billed” error

This is the most common error developers hit. It appears when you switch to a local model in Cursor and try to use Agent mode (the full agentic loop, not just Chat). Cursor’s Agent mode routes through a different code path than Chat and has additional checks against the model list.

Fix: After adding your local model, go back to Cursor Settings → Models. Make sure the only model checked in the entire list is your local one. If any premium model (Claude Sonnet, GPT-4o, etc.) is still selected alongside your local model, Cursor’s agent planner picks whichever it prefers for subtasks — and that selection may hit a billing gate.

Also check the Cursor Tab section in settings — there is no override for Tab. Leave Tab settings untouched; they always use Cursor’s cloud model regardless.

Does this actually save money?

Cursor Pro at $20/month includes a credit pool that covers roughly 500 Claude Sonnet requests, 2,000 GPT-4o Mini requests, or some mix. In practice, heavy Chat users who ask multi-paragraph questions burn through that pool in 2–3 weeks, then pay overage.

If you route Chat and Cmd+K through a local Qwen2.5-Coder-32B:

Those requests consume zero credits
Your Pro subscription credit pool goes entirely to Tab autocomplete (which actually uses Cursor’s proprietary small model, not Claude — so your credit pool is now almost entirely preserved for the agent tasks you can’t run locally)
Total AI spend: $20/month flat, regardless of usage volume

The cost argument works best for developers who use Chat heavily for explanation, refactoring planning, and documentation — where a capable local model is fully competitive with cloud models. It works less well for agentic tasks involving dozens of sequential tool calls, where Claude Sonnet’s reasoning reliability is still meaningfully ahead of local 32B models.

For a deeper comparison of cloud vs. local coding costs, see the AI coding speed: cloud API vs local LLM latency breakdown.

Quick comparison: Ollama vs LM Studio for this use case

	Ollama 0.6.x	LM Studio 0.4.15
Platform	macOS, Linux, Windows	macOS, Windows, Linux (AppImage)
Apple Silicon support	Native Metal acceleration	Native MLX since 0.3.x
CUDA detection	Manual	Automatic
Model browser	CLI + ollama.com	Built-in GUI
Server start	`ollama serve`	One-click in GUI
CORS fix required	Yes (`OLLAMA_ORIGINS="*"`)	No
LM Link (remote GPU)	No	Yes (0.4.15+)
Cursor context window	Set by model	Set in LM Studio’s server settings

For Linux and Apple Silicon, Ollama’s single binary and native Metal/ROCm support make it the cleaner option. On Windows with a CUDA GPU, LM Studio’s auto-detection and GUI model browser remove significant friction.

FAQ

Does Cursor know my prompts when I route them locally? When Chat uses the local base URL, the request goes directly from the Cursor client process on your machine to your local server. Cursor’s servers are not in the path for those requests. However, Cursor’s Tab autocomplete and any telemetry Cursor collects (feature usage, not prompt content per their privacy policy) still go to Cursor’s infrastructure.

Can I use a remote Ollama server (not localhost)? Yes. If Ollama runs on a separate machine in your LAN, point the base URL to that machine’s IP: http://192.168.1.100:11434/v1. Make sure Ollama is configured to bind on all interfaces (OLLAMA_HOST=0.0.0.0) on the server machine.

Will my context window be the same as with Claude? No. Qwen2.5-Coder supports up to 128K tokens, but in practice Ollama limits context via the num_ctx parameter (default is often 4096 unless you change it). Set a higher context in your Ollama model configuration: ollama run qwen2.5-coder:7b --ctx 32768. In LM Studio, the context window slider is in the model settings panel before you start the server.

Does this work with Cursor’s free Hobby tier? Yes. The Hobby tier has limited premium model requests, but the local model override bypasses that limit entirely for Chat and Cmd+K. You still don’t get Tab autocomplete on Hobby.

Can I switch between local and cloud models mid-session? Yes — go to the model selector in the Chat panel (the model name shown at the top of the chat window) and switch per-conversation. You don’t need to change the settings permanently.

Sources

Last updated June 6, 2026. Pricing and features change frequently; verify current state before purchasing.

Was this article helpful?