RAM Required for Cursor + Local Inference: The Actual Numbers

cursorlocal-llmramhardwareperformanceollamasetup-guide

The most common failure mode when combining Cursor with a local model is not a configuration error — it is simply running out of RAM. The symptom is unmistakable: Cursor’s autocomplete starts lagging seconds behind your keystrokes, Ollama logs fill up with context-length warnings, and your system monitor shows disk thrashing even though you have a decent machine.

Concrete RAM requirements for running Cursor alongside Ollama at each model tier — numbers are specific, not ranges so wide they are useless. If you are still choosing hardware, start with the table in section two; if you already have a machine that is struggling, skip to the context window multiplier and what happens when you run out.

For GPU and VRAM selection, the companion hardware tiers article covers component picks at $500, $1,500, and $3,000 budgets.


How RAM Gets Split Between Cursor and Ollama

Before looking at model-specific numbers, you need to understand what is actually competing for your system RAM.

Cursor’s memory footprint

Cursor is an Electron application built on the same Chromium base as VS Code. On a small project (under 200 files), Cursor and its extension host processes typically consume 500MB–900MB at idle. A medium project with TypeScript, ESLint, Prettier, and a few language servers running simultaneously raises that to 1.2–1.8GB. On large monorepos — tens of thousands of files, multiple language servers, open Git diffs — reported real-world usage runs 2–4GB, and the Cursor forums document pathological cases where multiple extension-host processes push past 8GB on projects with heavy AI chat history.

The practical floor for a productive Cursor session is 1GB. Budget 2GB if your project has more than 10,000 files or if you run Docker containers alongside it.

Ollama’s RAM usage (it is not just VRAM)

Ollama is often described as a “VRAM problem,” and VRAM is indeed the hard wall for inference speed. But Ollama also consumes system RAM for several reasons:

  1. KV cache spill. The key-value cache that stores past token activations lives in VRAM by default. When VRAM is tight — particularly with large context windows — Ollama spills KV cache entries into system RAM and eventually to disk.
  2. Model loading overhead. Ollama holds a small CPU-side buffer regardless of GPU offload percentage.
  3. Context window expansion. Each additional 1,000 context tokens adds roughly 0.3–0.5GB for 7B models, scaling proportionally with model size. At a 32K context, KV cache alone can match the weight size of the model.

The takeaway: even on a machine where the model fits entirely in VRAM, a large context window will start chewing into system RAM for cache.


RAM by Model Tier

The table below shows the minimum and comfortable system RAM for each model size when running Cursor simultaneously. “Minimum” means no swap, no disk thrash, but tight — KV cache is constrained and you will not be able to use long context windows. “Comfortable” means you can run Docker, a browser, a large Cursor project, and a 32K context window without hitting the ceiling.

Model sizeQuantVRAM footprintSystem RAM minimumSystem RAM comfortable
7B (e.g. Qwen 2.5 7B, Llama 3.1 8B)Q4_K_M~4.5–5GB16GB32GB
14B (e.g. Qwen 2.5 14B, Phi-4 14B)Q4_K_M~8.5–9GB32GB64GB
32B (e.g. Qwen 2.5 32B, DeepSeek R1 32B)Q4_K_M~19–22GB32GB64GB
70B (e.g. Llama 3.3 70B, Qwen 2.5 72B)Q4_K_M40GB+ (multi-GPU or CPU offload)64GB128GB

A few clarifications on these numbers:

7B on 16GB is workable, not comfortable. The OS plus Cursor takes roughly 3–4GB, Ollama’s model buffer takes another 1–2GB of system RAM even with the weights in VRAM, and the KV cache for anything beyond an 8K context window will push into the remaining headroom. You will function, but you will not be running Docker at the same time.

14B needs 32GB as a true minimum. The model’s VRAM footprint is under 9GB, but the combination of OS, Cursor (~1.5GB for a typical project), Ollama system buffers, and even a modest 16K context window consumes the rest of a 16GB system.

32B at 32GB is inference-only minimum. If your workstation exists purely to run Cursor as a thin local client and nothing else, 32GB clears the bar for a 32B model with a short context. Add a Docker container, a browser, or a 32K context window and you are swapping. Any serious development work at this model tier requires 64GB.

70B requires 64GB+ as a floor, not a target. These models only run on dual-GPU setups or with heavy CPU offloading. When CPU offloading, model layers that do not fit on the GPU must sit in system RAM. At 40GB of model weights plus OS plus Cursor plus KV cache, 64GB will be under pressure. 128GB is the working configuration for power users running 70B locally.


The Context Window Multiplier

This is the number most tutorials skip, and it is responsible for a large fraction of “why is my local model so slow” complaints.

Ollama’s default context window (num_ctx) is 4,096 tokens. This is fine for short coding exchanges. It is not fine for:

  • Agentic tasks where Cursor’s composer sends file contents as context
  • Multi-file refactors where the full diff is in the prompt
  • Long back-and-forth conversations without clearing context

Here is what increasing num_ctx does to memory:

num_ctx settingAdditional VRAM/RAM per request (7B model)Additional VRAM/RAM per request (32B model)
4K (default)baselinebaseline
8K+0.3–0.5GB+0.8–1.2GB
16K+0.8–1.2GB+2.0–3.0GB
32K+2.0–2.5GB+5.0–7.0GB
64K+5.0–6.5GB+12–15GB

The formula from the Ollama documentation is blunt: required memory scales directly with OLLAMA_NUM_PARALLEL × OLLAMA_CONTEXT_LENGTH. At 32K context with a 32B model, the KV cache alone consumes roughly the same memory as the model weights. If your GPU has 24GB of VRAM, there is no room for a 32K KV cache alongside a 32B model — it spills into system RAM.

How to set num_ctx in Ollama:

Via environment variable (sets the global default):

OLLAMA_CONTEXT_LENGTH=16384 ollama serve

Via Modelfile parameter (per model):

FROM qwen2.5:14b
PARAMETER num_ctx 16384

Via API request parameter (per request):

{
  "model": "qwen2.5:14b",
  "options": { "num_ctx": 16384 }
}

The recommendation for Cursor + Ollama setups: start at num_ctx 8192 unless you are actively working with long context tasks. 8K is enough for most single-file edits and short agent runs. Move to 16K when doing multi-file work. Only push to 32K if your hardware headroom supports it and you are sending large file trees as context.


Real Scenarios with RAM Estimates

Solo developer, small project, 14B model

Setup: Qwen 2.5 14B Q4_K_M on a single RTX 4070 Ti Super (16GB VRAM), Cursor open on a Python service (~300 files), browser with five tabs.

RAM consumption:

  • OS (Windows 11 or Ubuntu): 3GB
  • Cursor + extension host: 1.2GB
  • Ollama system buffers: 1GB
  • Browser: 0.8GB
  • KV cache at 8K context: ~0.5GB spill into system RAM

Total: ~6.5GB system RAM active load

Verdict: 32GB DDR5 is comfortable. 16GB works but leaves no headroom for Docker. This is the most common sweet spot for indie developers who want a real quality upgrade from 7B without going to 32B.

Power user, large monorepo, 32B model

Setup: Qwen 2.5 32B Q4_K_M, split across a 24GB GPU (RTX 4090) and system RAM via CPU offload for the remaining layers. Cursor open on a TypeScript monorepo (~8,000 files), ESLint server active, Docker running two containers.

RAM consumption:

  • OS: 4GB
  • Cursor + multiple language servers: 3–4GB
  • CPU-offloaded model layers: 4–6GB
  • KV cache at 16K context: 3GB
  • Docker: 3–4GB
  • Browser: 1GB

Total: ~18–23GB system RAM active

Verdict: 32GB is at the edge and will cause swap under peak load. 64GB removes the ceiling entirely. This is the definitive “why does my 32B local model feel sluggish on a 32GB machine” case — it is not VRAM, it is system RAM pressure from everything else competing with CPU offload.

Team lead: Cursor + Docker + local 14B model

Setup: Cursor with a large codebase, Docker running a local Postgres, Redis, and a dev API server, Ollama serving a 14B model for Cursor, browser with multiple tabs, Slack.

RAM consumption:

  • OS: 4GB
  • Cursor: 2.5GB
  • Docker (three containers): 4–6GB
  • Ollama 14B (VRAM-resident weights, system overhead): 1.5GB
  • KV cache spill at moderate context: 0.5–1.5GB
  • Browser + Slack: 1.5GB

Total: ~14–17GB active

Verdict: 32GB is stable, 64GB is genuinely recommended for peace of mind. With 32GB you will see periodic slowdowns when all workloads peak simultaneously. 64GB removes that friction entirely, and with current DDR5 pricing this is the configuration most senior developers should be targeting in 2026.


What Happens When You Run Out of RAM

The degradation follows a clear sequence:

  1. KV cache spills from VRAM to system RAM. Token generation slows by roughly 2–4×. You notice latency but it is not yet catastrophic.
  2. System RAM fills. The OS begins swapping KV cache data to disk. Token generation drops to 3–8 tokens/second. Cursor’s autocomplete timeout fires before the model responds.
  3. Full swap thrash. The system spends more time moving pages than generating tokens. You see <2 tokens/second. At this point, local inference is functionally broken — it is slower than a cloud API on a poor mobile connection.
  4. Cursor stutters on unrelated tasks. Cursor’s own rendering and extension host start competing for the remaining system RAM. The editor becomes unresponsive.

The transition from stage 1 to stage 3 happens faster than most people expect. Going from “slightly over the comfortable limit” to “system is thrashing” can take under a minute of sustained generation with a long context window.

Detection: Check ollama ps to see how many model layers are GPU-resident vs. CPU-resident. If you see 100% GPU but still have slowness, the bottleneck is KV cache. If you see partial GPU utilization, you are CPU-offloading and system RAM is the constraint.


RAM Speed: Less Important Than You Think

DDR4 vs DDR5 bandwidth is a common upgrade question. The honest answer: for local LLM inference, capacity beats speed. Every doubling of capacity (16→32GB, 32→64GB) eliminates a class of performance problem. DDR5’s bandwidth advantage (roughly 20–30% over DDR4-3600 in synthetic benchmarks) only matters in specific scenarios:

When RAM speed matters for LLM inference:

  • CPU-offloaded inference on large models (70B with partial GPU): DDR5-6000 can improve token throughput by 20–23% over DDR4-3200 in this specific workload
  • The CPU is feeding model layers to the GPU continuously and memory bandwidth is the bottleneck

When RAM speed does not matter:

  • The model is fully VRAM-resident (most setups up to 32B on a 24GB GPU): system RAM is not in the critical path
  • The KV cache fits in VRAM: no system RAM access during generation

The practical conclusion: if you are choosing between 32GB DDR5-5600 and 64GB DDR4-3600, buy the 64GB DDR4. More capacity is the correct upgrade for LLM inference in almost every real scenario.

For new builds in 2026: Intel LGA1851 and AMD AM5 both require DDR5. If you are building from scratch, DDR5-6000 in a 2×32GB or 2×48GB configuration is the right call. But if you are upgrading an existing DDR4 system, buy capacity, not speed.


macOS Unified Memory

Apple Silicon changes the math because CPU, GPU, and Neural Engine share a single memory pool. There is no separate VRAM — the same RAM that holds your Cursor project also feeds the model’s forward pass.

Unified memory allocation by total RAM:

Apple Silicon configEffective LLM budget (after OS overhead)What fits comfortably
8GB~4–5GBLlama 3.2 3B only — not suitable for daily Cursor + local model use
16GB~12–13GB7B model (Q4_K_M) — workable, cannot run alongside memory-heavy apps
24GB~20GB14B model (Q4_K_M) — comfortable for most development setups
36GB~30GB32B model (Q4_K_M) — the Mac sweet spot
64–96GB~50–80GB70B model (Q4_K_M) — M3 Max / M4 Max / M4 Ultra territory

macOS allocates approximately 75% of total unified memory as GPU-accessible by default, with the remainder reserved for system and CPU processes. Ollama on Apple Silicon now uses the MLX backend, which takes full advantage of this unified architecture and eliminates the PCIe bandwidth penalty that hurts discrete GPU setups when models partially spill into system RAM.

The critical caveat: Apple Silicon memory is soldered. The 8/16/24/36/96GB options are at purchase time. There is no upgrade path. This makes the “comfortable” tier the right spec target — the “minimum” tier leaves no room to grow as models improve.


The Upgrade Decision Guide

16GB → 32GB

Make this upgrade if:

  • You are running any 14B or larger model alongside Cursor
  • You run Docker containers as part of your development workflow
  • You are seeing swap usage during coding sessions

This is the most impactful upgrade for developers who have been on 16GB and are experiencing their first Cursor + Ollama setup.

32GB → 64GB

Make this upgrade if:

  • You work on large codebases (5,000+ files in Cursor)
  • You want to run 32B models without CPU offload pressure
  • You run Docker + browser + Cursor + Ollama simultaneously and feel the contention
  • You plan to move to 70B models with CPU offload

64GB → 128GB

Make this upgrade if:

  • You are running 70B models with significant CPU offload layers
  • You run multiple concurrent Ollama instances or use OLLAMA_NUM_PARALLEL > 1
  • Local inference is a primary tool, not a secondary one

Sources

  1. Ollama FAQ — context length, KV cache quantization, and memory scaling
  2. Ollama RAM & VRAM for Every Model — LocalAIMaster
  3. Ollama System Requirements: CPU, GPU, RAM Guide — LocalAIMaster
  4. KV Cache Memory: Calculating GPU Requirements for LLM Inference — Michael Brenndoerfer
  5. Why Ollama and llama.cpp Crawl When Models Spill into RAM — PopularAI
  6. Cursor RAM Usage Reports — Cursor Community Forum
  7. Your Mac’s RAM is its GPU: How Much Unified Memory for Local AI? — SolidAITech
  8. DDR5-6000 RAM for Local LLM Builds: Is It Worth It in 2026? — CraftRigs

For the full hardware picture — GPU selection, VRAM tiers, and component lists — see the Cursor + Local Llama hardware tiers guide. For a benchmark-focused comparison of local model performance on consumer GPUs, read the RTX 5060 Ti local LLM benchmark.

For system RAM sizing across all local AI workloads (not just coding), the runaihome.com article on system RAM for local LLMs covers the full picture.

Verified May 13 2026. Ollama version referenced: 0.6.x. Model VRAM figures based on Q4_K_M quantization; actual usage varies by architecture (GQA, flash attention) and context settings.

Was this article helpful?