AI Code Review Setup with Reviewdog and Local LLM

code-reviewreviewdoglocal-llmollamacigithub-actionssetup-guide

Automated code review that posts inline PR comments sounds like a Copilot feature requiring a $40/month seat. It is not. With Reviewdog, a 40-line Python script, and a self-hosted Ollama instance, you can run the same pattern on every pull request — for free, with no diff leaving your infrastructure.

This is not a “someday when the tooling matures” guide. Reviewdog is production-ready, has been used at scale in large open-source projects, and the Ollama API is stable enough to build CI pipelines on. What you are building here is a working PR review system, not a proof of concept.

Honest take upfront: This setup makes strong economic sense for teams of four or more. For solo developers, the operational overhead (keeping a GPU box online, maintaining the Ollama install) probably exceeds the value. The article covers both scenarios so you can make that call for your situation.


What Reviewdog Is

Reviewdog is an open-source tool written in Go that runs any linter, parses its output, and posts the results as inline review comments on a GitHub or GitLab PR. It was built to eliminate the friction of checking CI logs and cross-referencing them with your diff manually.

The key design insight is that Reviewdog does not care what the linter is. You pipe any tool’s output through Reviewdog’s adapter format and it handles the GitHub API calls, comment deduplication, and PR annotation. That means the same runner can post comments from golangci-lint, ruff, eslint, or — importantly — a custom script that calls a local LLM.

Reviewdog supports several reporter modes:

  • github-pr-review: posts inline comments on the diff
  • github-pr-check: posts a GitHub Checks annotation (cleaner for CI)
  • local: prints to stdout, useful for local development

The project has over 8,000 GitHub stars and integrations with more than 40 linters maintained in the reviewdog/action-* family of GitHub Actions.


Why Pair It with a Local LLM

Static linters are pattern matchers. They catch undefined variables, unused imports, and style violations. What they do not catch:

  • Logic that is technically valid but wrong for the context (“this condition will never be true given how fetchUser works upstream”)
  • Security smells that require understanding data flow (“this SQL is parameterized but the table name is interpolated”)
  • Naming that is confusing relative to the surrounding codebase
  • Code that duplicates logic that already exists elsewhere in the repo

LLMs catch all of those — at least some of the time. The challenge with cloud LLMs is cost at team scale. If you run 50 PRs a month with diffs averaging 200 lines, a GPT-4o call per review runs to roughly $15–25/month at current pricing. Not catastrophic, but it compounds across teams and across tools. A local LLM eliminates that cost entirely after the hardware is sunk.

The other driver is privacy. If you work on proprietary code, every diff you send to OpenAI or Anthropic’s API is subject to their data policies. A local Ollama instance running on your own hardware or a self-hosted VM means the code never leaves your network. That is often a hard requirement for enterprise and regulated-industry teams.

See also: Cline + Local LLM Privacy-First Setup in 2026 for a deeper treatment of the privacy argument and how local inference fits into a dev workflow beyond just code review.


Stack Overview

The complete stack has three layers:

Layer 1 — Static analysis (Reviewdog native)

  • golangci-lint for Go
  • ruff for Python
  • eslint for TypeScript/JavaScript

These run fast (under 30 seconds), catch deterministic issues, and do not require GPU compute. Reviewdog handles posting their output as PR comments natively.

Layer 2 — LLM diff review (custom script) A Python script that calls a local Ollama endpoint with the full git diff and returns structured comments in Reviewdog’s rdjsonl format. This is the part most setups skip over. Details below.

Layer 3 — Orchestration A GitHub Actions workflow that wires layers 1 and 2 together on every PR. The workflow checks out the repo, installs tools, runs both layers, and posts results via the GITHUB_TOKEN.


Model Choice: Why Reasoning Quality Beats Speed Here

The instinct for CI is to pick the fastest model. For code review, that instinct is wrong.

A fast small model (Llama 3.2 3B, Phi-3 mini) produces code review comments that are syntactically plausible but often wrong or generic. “This function could be more efficient” on a function that is already O(1) is worse than no comment — it trains developers to ignore the reviewer.

For code review specifically, you want a model with enough context capacity and reasoning depth to understand what the code is trying to do and identify where it falls short. The two models that hold up at local inference scale in 2026:

Qwen2.5-Coder 32B — 32B parameters, requires ~20 GB VRAM in Q4 quantization. Strong code understanding, low hallucination rate on logic errors. If you have an RTX 4090 or a dual-GPU setup, this is the pick.

DeepSeek-Coder-V2 Lite (16B) — runs in ~10 GB VRAM, slightly more aggressive in flagging issues (occasional false positives). Good tradeoff if you are on a single 16 GB card.

For detailed model-to-VRAM mapping, see: Best Local AI Models by VRAM on runaihome.com.

Models to avoid for this task: anything under 7B. The 3B–7B range produces generic comments that add noise without value. If your hardware cannot run at least a 14B model at Q4, consider whether the LLM review layer adds enough to justify running it — or use the static analysis layer only and save the LLM review for local pre-commit use.


The LLM Review Script

This script calls a local Ollama endpoint with the staged git diff and emits Reviewdog-compatible rdjsonl output. Save it as .github/scripts/llm-review.py.

#!/usr/bin/env python3
"""LLM diff reviewer — emits rdjsonl for Reviewdog."""
import json
import subprocess
import sys
import urllib.request

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL = "qwen2.5-coder:32b"
MAX_DIFF_LINES = 400  # keeps context within model limits

def get_diff() -> str:
    result = subprocess.run(
        ["git", "diff", "origin/main...HEAD", "--unified=3"],
        capture_output=True, text=True, check=True
    )
    lines = result.stdout.splitlines()
    return "\n".join(lines[:MAX_DIFF_LINES])

def ask_llm(diff: str) -> list[dict]:
    prompt = f"""You are a senior code reviewer. Review the following git diff.
For each issue found, respond with a JSON object on its own line:
{{"path": "<file>", "line": <line_number>, "message": "<issue description>", "severity": "ERROR|WARNING|INFO"}}
Only output JSON lines. No explanations. No markdown.

Diff:
{diff}"""

    payload = json.dumps({"model": MODEL, "prompt": prompt, "stream": False})
    req = urllib.request.Request(
        OLLAMA_URL,
        data=payload.encode(),
        headers={"Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=120) as resp:
        body = json.loads(resp.read())
    return body.get("response", "").strip().splitlines()

def to_rdjsonl(lines: list[str]) -> None:
    for raw in lines:
        try:
            item = json.loads(raw)
            print(json.dumps({
                "message": item["message"],
                "severity": item.get("severity", "WARNING"),
                "location": {
                    "path": item["path"],
                    "range": {"start": {"line": int(item["line"]), "column": 1}}
                },
                "source": {"name": "llm-reviewer", "url": ""}
            }))
        except (json.JSONDecodeError, KeyError):
            pass

if __name__ == "__main__":
    diff = get_diff()
    if not diff.strip():
        sys.exit(0)
    raw_lines = ask_llm(diff)
    to_rdjsonl(raw_lines)

A few implementation notes:

  • MAX_DIFF_LINES = 400 is a practical cap. Most models handle 2,000–4,000 tokens of diff context reliably; beyond that quality degrades. If your PRs routinely exceed 400 lines, consider running the script per-file rather than against the full diff.
  • The stream: false flag on the Ollama request simplifies parsing. For large diffs, streaming with a timeout is safer — adjust if your model is slow to respond.
  • The script emits rdjsonl (Reviewdog JSON Lines) format. Each line is one comment. Reviewdog reads this from stdout.
  • Error handling is intentionally minimal here. For production, wrap the urlopen call in a retry loop with exponential backoff.

GitHub Actions Workflow

Save this as .github/workflows/review.yml. It runs on every PR targeting main.

name: AI Code Review

on:
  pull_request:
    branches: [main]
    types: [opened, synchronize, reopened]

permissions:
  contents: read
  pull-requests: write
  checks: write

jobs:
  static-review:
    name: Static Analysis (Reviewdog)
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Install Reviewdog
        uses: reviewdog/action-setup@v1
        with:
          reviewdog_version: latest

      - name: Run ruff (Python)
        uses: reviewdog/action-ruff@v1
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          reporter: github-pr-review
          level: warning
          ruff_flags: "--select=E,F,W,C,N"

      - name: Run eslint (JS/TS)
        uses: reviewdog/action-eslint@v1
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          reporter: github-pr-review
          eslint_flags: "src/**/*.{ts,tsx,js,jsx}"

  llm-review:
    name: LLM Review (Local)
    runs-on: self-hosted          # must be a runner with Ollama installed
    needs: static-review
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Install Reviewdog
        uses: reviewdog/action-setup@v1
        with:
          reviewdog_version: latest

      - name: Pull model if not cached
        run: ollama pull qwen2.5-coder:32b

      - name: Run LLM diff review
        env:
          REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          python3 .github/scripts/llm-review.py \
            | reviewdog -f=rdjsonl \
              -name="llm-reviewer" \
              -reporter=github-pr-review \
              -level=warning \
              -filter-mode=diff_context

The critical detail here is runs-on: self-hosted on the llm-review job. GitHub’s hosted runners do not have a GPU, and even if they did, you would not want to install and load a 32B model per workflow run — the cold start alone would cost 10+ minutes. The self-hosted runner should be a machine with Ollama already running and the model already pulled. The ollama pull step is a no-op if the model is cached; it handles the case where the model was evicted or the runner was reprovisioned.

The static-review job runs on ubuntu-latest (hosted runner), so only the cheap, fast linting burns hosted-runner minutes. The GPU-dependent LLM step runs on your hardware.

Registering a self-hosted runner: In your GitHub repo, go to Settings → Actions → Runners → New self-hosted runner. Follow the install instructions for your OS. On the machine running Ollama, run the ./run.sh command from the GitHub instructions. The runner connects to GitHub over HTTPS — no inbound ports required.


Reviewdog Configuration Snippet

For projects using a .reviewdog.yml config file (useful when you want to add custom linters beyond the managed GitHub Actions):

runner:
  llm-reviewer:
    cmd: python3 .github/scripts/llm-review.py
    format: rdjsonl
    level: warning
    reporter: github-pr-review

  golangci:
    cmd: golangci-lint run --out-format=line-number ./...
    errorformat:
      - "%f:%l:%c: %m"
    level: error
    reporter: github-pr-review

This lets you run reviewdog -conf=.reviewdog.yml locally before pushing, getting the same LLM feedback in your terminal without waiting for CI.


What the LLM Review Catches vs. Misses

After running this stack on a team’s codebase for a few months, the pattern of what gets flagged and what gets missed is fairly predictable.

The LLM is reliably useful for:

  • Naming clarity — when a variable name is misleading relative to what it actually holds
  • Missing null/error handling — “this function returns undefined in the empty-array case but the caller assumes a string”
  • Security patterns at the diff level — hardcoded secrets, SQL fragments that look parameterized but aren’t, JWT decode without verification
  • Dead code left in a PR — commented-out blocks, unused imports that the linter missed
  • Logic tautologies — conditions that are always true or always false given the surrounding context

The LLM regularly misses:

  • Repo-specific conventions it hasn’t been given context about — if your codebase has a pattern like “all database calls go through the repository layer,” the model won’t know to flag a PR that bypasses it
  • Performance issues that require understanding the full call graph — it sees the diff, not the runtime profile
  • Test coverage gaps — it can comment on tests it sees, but it doesn’t know what the existing test suite covers
  • Subtle threading bugs — unless the diff clearly shows a race, these are nearly invisible to single-diff analysis

The mitigation for the first category is prompt engineering: include a preamble in your system prompt describing the top 5 project conventions the model should enforce. Something like "This repo uses the repository pattern: database access must go through classes in /src/repositories. Flag any direct ORM calls in controllers." That preamble dramatically improves the signal-to-noise ratio.


The Copilot PR Review Alternative

GitHub Copilot’s PR summary and review feature (rolled out in 2025, now standard on the Business and Enterprise tiers) does roughly the same thing through the GitHub UI — it reads the diff and posts a summary plus inline suggestions. The Business tier runs $19/user/month; Enterprise is $39/user/month.

See the GitHub Copilot 2026 review for a full breakdown of what those tiers include.

The Copilot PR review is more polished out of the box. It understands repository context better than a diff-only approach because GitHub can feed it more of the codebase. For teams already on GitHub and already paying for Copilot, it is the easier path.

The Reviewdog + local LLM stack wins on:

  • Cost at scale — a 5-person team on Copilot Business costs $95/month just for PR review access. A self-hosted runner with a capable GPU costs that much once for the hardware, then nothing.
  • Privacy — every diff you send to Copilot’s backend is processed on Microsoft’s infrastructure. For teams with IP sensitivity or regulated-industry requirements, that is a blocker.
  • Customizability — you control the prompt, the model, the output format, and which files get reviewed.

The Reviewdog stack loses on:

  • Setup time — you are looking at 3–4 hours to get this working end-to-end the first time
  • Reliability — self-hosted runners go offline; your GPU box will need maintenance
  • Model quality ceiling — Copilot’s backend almost certainly runs something much larger than 32B. For nuanced architectural feedback, the cloud model wins.

For teams considering privacy-first local AI tooling more broadly, the same principles apply to the full development workflow: Cline + Local LLM Privacy-First Setup 2026 covers that angle in depth.


Custom Rules and Context Injection

One underused lever is feeding the model your project’s custom rules alongside the diff. This connects to the broader pattern of using rule files to shape AI behavior — see Cursor Custom Rules and Templates 2026 for how that pattern works in an IDE context.

In the LLM review script, you can extend the prompt construction to read from a .review-rules.md file in the repo root:

import pathlib

def get_rules() -> str:
    rules_path = pathlib.Path(".review-rules.md")
    if rules_path.exists():
        return rules_path.read_text()
    return ""

# In ask_llm(), prepend to prompt:
rules = get_rules()
if rules:
    prompt = f"Project conventions to enforce:\n{rules}\n\n" + prompt

A .review-rules.md might contain:

- All database queries must use the repository pattern in /src/repositories/
- Do not use console.log in production code (use the Logger class)
- Passwords and secrets must never appear in source files or logs
- Async functions must handle errors explicitly — no naked awaits

With this context, the model stops being a generic reviewer and starts behaving like a reviewer who knows your codebase.


The Honest ROI Calculation

For a solo developer: The setup cost is 3–4 hours. You need a machine powerful enough to run a 32B model (or accept a lower-quality 14B model). You need to maintain the self-hosted runner. The free static analysis layer (Reviewdog + ruff/eslint) is worth doing regardless — that part takes 30 minutes. The LLM layer, for a solo dev, is marginal. You already know your codebase; you are your own code reviewer. Use the LLM review locally via reviewdog -conf=.reviewdog.yml before pushing if you want AI feedback, but the full CI pipeline is overkill.

For a team of 4–8: The economics flip. At $19/user/month for Copilot Business, a 6-person team spends $1,368/year on PR review. A used RTX 4090 costs around $1,000 and runs Qwen2.5-Coder 32B reliably. Year one, you break even. Year two, the stack is free. The privacy argument is usually the actual decision driver for teams in this range, though.

For teams above 10: Most teams at this size have more pressing problems than review tooling cost. The architecture argument (control over model, prompt, output format) matters more than the dollar figure. If your security posture requires code to stay on-premises, this is the only viable automated review stack.


Sources

  1. Reviewdog — official GitHub repository: https://github.com/reviewdog/reviewdog
  2. Ollama API documentation (generate endpoint): https://github.com/ollama/ollama/blob/main/docs/api.md
  3. Reviewdog GitHub Actions integrations: https://github.com/reviewdog
  4. Qwen2.5-Coder model card on HuggingFace: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
  5. DeepSeek-Coder-V2 technical report and model details: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
  6. GitHub self-hosted runners documentation: https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/about-self-hosted-runners
  7. Best Local AI Models by VRAM (runaihome.com): https://runaihome.com/blog/best-local-ai-models-by-vram/

Last verified: May 13, 2026. Reviewdog v0.20, Ollama 0.5, Qwen2.5-Coder 32B Q4_K_M quantization.

Was this article helpful?