GLM 5.2 as your Cursor and Cline backend in 2026: MIT-licensed open-weight coding model, the config that works, and the honest cost math

glmcursorclinecontinue-devlocal-llmsetup-guideapicostvs

TL;DR: GLM 5.2 is Z.ai’s 743B-parameter MoE coding model, released mid-June 2026 under an MIT license with open weights. At roughly $1.40/M input via the Z.ai API it slots into Cursor, Cline, and Continue.dev through a standard OpenAI-compatible endpoint in about ten minutes. It tops the open-weight field on long-horizon agentic coding — but the Cursor base-URL override has a sharp edge you need to know before you flip it on.

GLM 5.2 (API)DeepSeek V4-FlashClaude Fable 5
Best forLong-horizon agentic Cline/Cursor workCheapest agent backendTop-tier reasoning, polish
Price (input / output per M)~$1.40 / ~$4.40$0.14 / $0.435$10 / $50
LicenseMIT (self-host free)MIT (self-host free)Proprietary (API only)
Context window1M1M200K
Params743B total / 39B active (MoE)MoE (cloud)proprietary
The catchCursor BYOK override breaks built-in modelsThinking mode breaks Cline if left on7× the price of GLM for daily loops

Honest take: If you run agentic loops in Cline or Cursor all day and the Claude bill is starting to sting, GLM 5.2 is the open-weight model to switch to right now — it’s the strongest agent backend you can legally self-host, and the API is a fraction of Claude’s price. Use Cline (not Cursor) for the cleanest setup, because Cursor’s single global base-URL override is still a trap. Self-hosting only pays off above very heavy team usage — the break-even math is at the bottom.

What landed in June 2026

Z.ai (the company formerly known as Zhipu AI) shipped GLM 5.2 in mid-June 2026, with the open weights going live on Hugging Face around June 16. The model is a 743-billion-parameter mixture-of-experts transformer with roughly 39 billion active parameters per token, routed across 256 experts. The headline spec is the context window: a native 1,048,576-token (1M) window with a 131,072-token max output, both substantial jumps over GLM 5.1.

The two things that make it worth a fresh look are the license and the position. The weights are MIT-licensed — about as permissive as it gets, which means you can self-host inside a commercial product, run it air-gapped, or fine-tune it without a lawyer in the loop. And on the benchmarks that coverage has reported, it’s the top open-weight model for long-horizon coding.

On SWE-bench Pro — the harder variant that tests whether a model can resolve real-world repository issues — GLM 5.2 scored 62.1, ahead of GPT-5.5 at 58.6 and its own predecessor GLM 5.1 at 58.4. On Terminal-Bench 2.1 (autonomous terminal-based coding) it reported 81.0, within four points of Claude Opus 4.8’s 85.0. On the classic SWE-bench Verified it lands around 77.8, trailing the proprietary frontier (Claude Opus and GPT-5.x sit in the low 80s) but leading every other open-weight model.

One honest caveat: Z.ai launched without a full official benchmark table, so several of these figures come from independent coverage and the model card rather than a single first-party page. Treat the agentic numbers as “best open-weight, near-frontier,” not as gospel to the decimal. The thing you actually feel day to day — that it holds engineering context across a 30-step Cline run without losing the thread — is real, and it’s the reason to care.

Three ways to run it

You have three paths, and they map to different goals and budgets:

  1. Z.ai API (api.z.ai) — fastest, zero hardware, OpenAI-compatible. Pay-as-you-go at roughly $1.40/M input and $4.40/M output, with cached input billed around $0.26/M. This is the default for almost everyone.
  2. GLM Coding Plan — a flat subscription that routes GLM 5.2 to coding tools through a dedicated endpoint. Promotional tiers run around $10/month (Lite), $30/month (Pro), and $80/month (Max); the published list prices are higher and discounted by billing cycle, so verify the number on the checkout page the day you buy.
  3. Self-hosted via vLLM or SGLang — the MIT payoff. At FP8 the weights are roughly 744 GB, which is an 8×H200 (or larger) node, not a workstation. This only makes sense at real scale or under a hard data-residency requirement.

A quick smoke test against the API confirms your key and the model name before you touch any editor config:

$ curl -s https://api.z.ai/api/paas/v4/chat/completions \
  -H "Authorization: Bearer $ZAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [{"role":"user","content":"Write a Python function that returns the nth Fibonacci number iteratively."}]
  }' | python3 -c "import sys,json; print(json.load(sys.stdin)['choices'][0]['message']['content'])"

def fib(n: int) -> int:
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

Verified against the Z.ai API on June 20, 2026. Two endpoints matter and they are easy to confuse: the general endpoint is https://api.z.ai/api/paas/v4, and if you bought the GLM Coding Plan you must use the coding endpoint https://api.z.ai/api/coding/paas/v4 instead. Point a Coding Plan key at the general URL (or vice versa) and you’ll get 401s that look like a bad key.

Cline: the cleanest setup

Cline treats GLM 5.2 as a plain OpenAI-compatible provider, and because each provider config is self-contained, there’s no collision with your other models. This is the path I’d recommend first.

Open the Cline settings (the gear icon in the Cline panel), then:

  1. API ProviderOpenAI Compatible
  2. Base URLhttps://api.z.ai/api/paas/v4 (or the /coding/paas/v4 URL if you’re on the Coding Plan)
  3. API Key → your Z.ai key
  4. Model IDglm-5.2

That’s it. Set the context window in Cline’s model settings to match what you actually want to pay for — the 1M window is available, but every agentic step re-sends the running context, so a sprawling window on a long Plan/Act loop is how you turn a $5 task into a $40 one. For most repo work, capping the model context at 128K–256K is the sane default; reserve the full 1M for genuine whole-repo reasoning.

The real-world gotcha here isn’t GLM-specific but it bites Z.ai users hard: some agent frontends append a spurious /v1 to whatever base URL you give them. If you paste https://api.z.ai/api/paas/v4 and the tool silently calls https://api.z.ai/api/paas/v4/v1/chat/completions, you get a 404 that the UI may report as a model-switch error rather than a bad URL. If your requests 404 immediately, check the resolved URL in the tool’s logs before you blame the key.

Cursor: it works, but mind the override

Cursor can use GLM 5.2 through its custom-model path, and the steps are straightforward:

  1. Settings → Models → Add Custom Model, choose the OpenAI protocol.
  2. Enter the model name. Cursor has historically wanted the model name in uppercase in this field (people hit this with GLM-4.7), so if glm-5.2 is rejected, try GLM-5.2.
  3. Toggle on OpenAI API Key and paste your Z.ai key.
  4. Toggle on Override OpenAI Base URL and set it to https://api.z.ai/api/coding/paas/v4 (Coding Plan) or https://api.z.ai/api/paas/v4 (pay-as-you-go).

Here’s the sharp edge, and it’s a documented one: overriding the OpenAI base URL is global in Cursor, not per-model. The moment you point that override at Z.ai, your custom GLM model works — but Cursor’s own first-party models (the ones it proxies through its servers) stop working, because Cursor now routes everything through your override URL. You’re effectively in BYOK-only mode.

Practically, this means Cursor is an all-or-nothing switch for GLM today: great if you want GLM as your single backend, frustrating if you wanted to keep Cursor’s bundled Claude/GPT access alongside it. There’s an open feature request for per-model base URLs, but until it ships, the workaround most people use is OpenRouter — point the override at OpenRouter’s base URL once and select GLM 5.2 (and anything else) as OpenRouter models, so you’re not flipping the global switch every time you change models. If you want GLM and a mix of other models inside one editor without the override headache, that’s the reason to prefer Cline.

Continue.dev: chat and edit, skip FIM

Continue.dev (VS Code and JetBrains) wires GLM 5.2 in via config.yaml:

models:
  - name: GLM 5.2
    provider: openai
    model: glm-5.2
    apiBase: https://api.z.ai/api/paas/v4
    apiKey: YOUR_ZAI_KEY
    roles:
      - chat
      - edit
      - apply

Use GLM 5.2 for the chat, edit, and apply roles, not for autocomplete. Like most large MoE reasoning models, it’s built for multi-step agentic work, not the sub-200ms fill-in-the-middle latency that inline tab completion needs. Pair it with a small, fast FIM model in the autocomplete role — a 1.5B–7B local coding model via Ollama is the standard move. The Continue.dev + Ollama setup guide covers wiring a local FIM model alongside a cloud chat model in the same config.

A problem you’ll actually hit: runaway agent cost

The first time I let GLM 5.2 run an unbounded Cline “Act” loop on a medium repo with the context window left wide open, a single feature took ~$6 in API spend — not because the per-token price is high, but because each of ~25 agent steps re-sent a growing context, and a few of those steps pulled large files into the window.

Three fixes, in order of impact:

  • Cap the model context in your tool’s settings (128K is plenty for most single-feature work). The 1M window is a capability, not a default you should leave on.
  • Lean on prompt caching. GLM 5.2’s cached-input rate (~$0.26/M) is roughly a fifth of the fresh-input rate, and agentic loops re-send near-identical context every step — exactly the pattern caching is built for. Keep your system prompt and rules stable so they stay cached.
  • Use Plan mode first. Let the model write the plan, approve it, then run Act. An approved plan cuts the number of exploratory steps, and exploratory steps are where the token bill balloons.

This is the same dynamic that bit GitHub Copilot users after its June 2026 token-billing switch: the model isn’t expensive, the loop is. GLM’s low per-token price gives you more headroom, but it doesn’t repeal the math.

Self-hosting: when the MIT license actually pays off

The MIT weights are the headline, but self-hosting GLM 5.2 is a server-class undertaking, not a homelab one. At FP8 the weights are roughly 744 GB, which needs an 8×H200 node (about 1,128 GB aggregate VRAM, leaving room for KV cache and overhead). A typical vLLM launch looks like:

vllm serve zai-org/GLM-5.2-FP8 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --kv-cache-dtype fp8_e5m2 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":5}'

At 1M context the FP8 KV-cache flag isn’t optional — it’s what makes the long window fit on 8×H200 at all. An AWQ INT4 quant roughly halves the footprint to ~372 GB, letting it fit on a 4×H200 node if you can accept the quality hit.

The cost reality: an 8×H100 node rents for roughly $16–24/hour on-demand; a 4×H200 setup runs around $7+/hour on spot. Even at the low end, that’s well over $5,000/month if you keep it warm 24/7. Against the API at ~$1.40/M input, you’d need to be burning on the order of billions of tokens a month before owning the inference beats renting it. For all but the largest teams — or anyone with a hard “code cannot leave our network” mandate — the API or the Coding Plan is the correct answer, and the MIT license is mostly insurance and fine-tuning freedom rather than a daily-driver cost play. If you do want to size a box, the runaihome.com GLM 5.2 hardware guide has the VRAM-by-quant breakdown, and for rentable multi-GPU nodes RunPod is the usual starting point.

How it stacks up

Against the other open-weight backends worth wiring in this year, GLM 5.2 wins on agentic reasoning and loses on raw cheapness:

GLM 5.2DeepSeek V4-FlashKimi K2.7 CodeCodestral 2
StrengthLong-horizon agentic loopsCheapest tokens1T-param raw capabilityFIM autocomplete
Input price / M~$1.40$0.14varies by host$0.30
Context1M1M256K256K
Self-host floor8×H200 (744 GB FP8)cloud-firstvery largesingle 16 GB GPU
LicenseMITMITModified MITApache 2.0

If pure cost-per-token is the only thing you care about, DeepSeek V4-Flash is roughly 10× cheaper on input and good enough for a lot of agent work. If you want the best inline autocomplete you can run locally, Codestral 2 fits a single 16 GB card. GLM 5.2’s lane is the middle-to-high end: you’re running long, stateful agent sessions where staying on-track across 20+ steps matters more than shaving fractions of a cent, and you want frontier-adjacent quality without Claude Fable 5’s price tag.

FAQ

Is GLM 5.2 actually free? The weights are MIT-licensed, so running them on your own hardware is free of license cost (not free of GPU cost). The hosted API and the GLM Coding Plan are paid.

Can I use the 1M context for free on the API? The 1M window is available, but you pay per token, and re-sending a large context every agent step gets expensive fast. Cap your tool’s context window unless you genuinely need whole-repo reasoning.

Why does Cursor stop using its own models after I set the base URL? Cursor’s base-URL override is global, not per-model. Pointing it at Z.ai routes all requests there, which disables Cursor’s proxied first-party models. Use OpenRouter as the override target, or use Cline instead, to avoid the all-or-nothing switch.

Which endpoint do I use — /paas/v4 or /coding/paas/v4? Pay-as-you-go API keys use https://api.z.ai/api/paas/v4. GLM Coding Plan subscriptions use https://api.z.ai/api/coding/paas/v4. Crossing them produces auth errors.

Should I self-host it? Almost certainly not, unless you have a hard data-residency rule or you’re burning billions of tokens a month. The API is far cheaper than keeping an 8×H200 node warm.

Is there a data-privacy concern with the Z.ai API? Z.ai is a China-based provider, and some coverage flags data-handling considerations for sensitive or regulated code sent to the hosted API. If that matters for your work, that’s exactly the case where the MIT weights and self-hosting earn their keep.

Sources

Last updated June 20, 2026. Pricing, benchmark figures, and features change frequently; Z.ai launched GLM 5.2 without a full first-party benchmark table, so verify current numbers on the official pricing and model pages before purchasing.

Was this article helpful?