Test-Driven AI Coding: The Workflow That Actually Catches Bugs
The bug doesn’t come from code that fails to compile. It comes from code that compiles, passes all 24 tests, ships to production, and then silently returns wrong results for every input that wasn’t in the original prompt.
AI coding tools have a specific failure mode that raw code review doesn’t catch: tautological tests. When you ask Cursor, Copilot, or Cline to “add tests for this function,” the tool reads your implementation and writes tests that verify it. If the implementation is wrong, the tests confirm the wrong behavior. Coverage goes up; confidence goes up; bugs survive.
Test-Driven Development forces a different order. Tests come first — written against a spec, not an implementation. The implementation is written to make those tests pass. This constraint is exactly what transforms AI coding tools from confident bug generators into reliable collaborators.
The overhead is real. This article covers what actually works per tool, what discipline costs you in speed, and where TDD still leaves gaps even when you do it right.
The tautological test problem, quantified
When developers let AI generate tests after writing code, roughly 35% of the resulting tests are tautological — they pass because they mirror the implementation’s internal logic, not because they verify the correct behavior. Flip the order: write a spec, generate tests from the spec, then generate the implementation, and that rate drops to 5–10% (source: GitHub Copilot documentation on spec-driven workflows).
The mechanism is simple. A test for a calculate_discount(price, tier) function that was written by reading the implementation will assert whatever the implementation does. If the implementation has an off-by-one error on Gold tier discounts, the test passes with the wrong expected value. Neither coverage metrics nor CI pipelines catch this — the test suite is green, and the bug ships.
There’s a deeper variant: the test computes its expected output by calling the function under test. This is worse than tautological — it’s circular. Any input produces a “passing” test because the expected and actual values are generated by identical code paths.
The fix in both cases is the same: tests must precede implementation, and expected values must come from a spec, not the code.
The three-phase discipline (and why enforcement matters)
Red–Green–Refactor is the standard TDD loop. In AI-assisted development, each phase requires explicit enforcement because AI tools will collapse them if you don’t stop them.
Red — The agent writes tests only. No implementation changes. If you run the test suite at the end of this phase, everything should fail. If a test passes without implementation, the test is wrong.
Green — The agent writes the minimal implementation to pass the new tests. “Minimal” matters. Permitting large, speculative implementations here is how scope creeps and bugs enter.
Refactor — Structure improves; behavior doesn’t change. Tests stay green throughout. This is the phase where the agent can rename, extract, and reorganize safely.
Any agent that modifies production code during Red is violating the contract. With tools like Cursor and Cline, the enforcement mechanism is your rules file. With Aider, it’s the --auto-test flag. With Copilot in VS Code 2026, it’s the dedicated agent mode.
Cursor: rules + Agent mode
Cursor’s Agent mode and .cursor/rules files give you the enforcement surface you need for TDD. The minimal setup:
Create .cursor/rules/tdd.mdc with (for more on rule file structure, see Custom Cursor Rules: Templates That Actually Work):
---
description: TDD discipline — enforce phase separation
alwaysApply: true
---
PHASE RED: Write failing tests only. Do not touch production code in src/.
Stop and confirm when all new tests exist and are failing.
PHASE GREEN: Write minimum implementation to pass failing tests.
Do not refactor. Do not add features not covered by a failing test.
PHASE REFACTOR: Improve code structure only. No behavior changes.
All tests must stay green throughout.
NEVER compute expected test values by calling the function under test.
NEVER add tests after writing implementation — tests come first.
In practice, the workflow is:
- Open Cursor Agent (Cmd/Ctrl+Shift+I). Switch to Plan mode.
- Describe the feature as a spec — inputs, outputs, edge cases, what should fail. Do not describe implementation.
- Ask Agent to write tests in Plan mode. Review them before any code runs.
- Switch to Agent mode (not Plan). Ask it to implement until tests pass.
- Ask it to refactor — then run the suite one more time.
The step most developers skip is reviewing the tests before switching to implementation. That five-second check is where you catch circular assertions before they become production bugs.
GitHub Copilot / VS Code: dedicated TDD agents
VS Code’s Copilot introduced purpose-built TDD agents in 2026. Three agent files, one for each phase, with automatic handoffs between them:
.github/agents/TDD-red.agent.md— writes failing tests only, explicitly forbidden from touching implementation.github/agents/TDD-green.agent.md— writes minimal implementation, runs test suite automatically.github/agents/TDD-refactor.agent.md— refactors with tests running as a guard
Create these via Command Palette → Chat: New Custom Agent.
The handoff points between phases are manual checkpoints — you click to advance. This is intentional. The documentation notes: “Handoffs provide control points where you can assess each step, verify the AI’s work, and steer the agent.” Treating them as friction to click through defeats the purpose.
For simpler scenarios, the /tests slash command in Copilot Chat generates tests that match your project’s existing conventions (pytest fixture patterns, Jest describe/it structure) without configuration. The catch: /tests runs after your implementation exists. For real TDD, you need to use the agent workflow above and explicitly prompt: “Write tests for username validation that enforces these rules: [your spec]. Do not write any implementation code.”
Aider: automated red-green loop via —auto-test
Aider’s TDD integration is the tightest of any tool in the category because the test-run loop is baked into the architecture, not bolted on via a prompt. If you haven’t set Aider up yet, start with Aider with Local LLM via Ollama in 2026 — in particular the context window configuration, which affects how reliably Aider tracks failing tests.
aider --test-cmd "pytest tests/" --auto-test
With --auto-test enabled, Aider runs your test suite after every code change. If tests fail, it reads the failure output, reads the changed code, proposes a fix, and re-runs. This loop continues until tests pass or Aider gives up and asks you.
For a real TDD flow with Aider:
- Write your test file manually (or prompt Aider in a fresh session with no production code to add).
- Verify the tests fail:
pytest tests/should show red. - Start Aider with the test command:
aider src/feature.py tests/test_feature.py --test-cmd "pytest tests/test_feature.py" --auto-test - Prompt: “Implement feature.py to pass the tests. Do not moDify the test file.”
Aider will iterate — often 2–4 rounds — until the suite is green. This is the closest any tool comes to automated TDD without human intervention at each phase.
The context window trap still applies: keep tests and implementation files small. If Aider is loading 3,000+ lines of context, it starts losing track of which tests are failing and why. One feature, one test file, one session.
Cline: Plan mode for specs, Act mode for implementation
Cline’s Plan/Act toggle maps directly onto the Red/Green boundary.
Use Plan mode to write the spec and tests. Cline will explore your codebase, ask clarifying questions about edge cases, and produce a test file. Because Plan mode doesn’t execute code or write to files, you have a review window before any implementation runs.
Once the test file looks right and you’ve verified it fails:
- Switch to Act mode.
- Prompt: “Implement [feature] to pass the tests in tests/test_feature.py. Do not modify the test file. Run the tests after each attempt.”
Cline’s real-time terminal watching means it sees test failures as they happen and can self-correct without requiring you to copy-paste error messages. The community TDD starter template at ppeach/cline-starter-TDD provides pre-configured .clinerules for this workflow. For the broader Cline + local model setup context, see Cline + Local LLM Privacy-First Setup in 2026.
Add this to your .clinerules to prevent mode collapse:
TDD discipline:
- Never modify test files after the spec has been approved in Plan mode
- Treat a passing test suite as "done" — do not add features beyond what tests specify
- If asked to "add tests after the fact," refuse and ask the user to switch to Plan mode first
Writing specs AI won’t corrupt
The spec you write before tests determines everything. A spec that reads like a description of an obvious implementation will produce tests that merely restate that implementation.
The trick from the dev.to TDD-with-AI article: avoid solution keywords in your spec. The canonical example is FizzBuzz. If your spec says “return ‘Fizz’ for multiples of 3,” AI generates tests that hard-code ‘Fizz’. If your spec says “return the first marker for multiples of the first divisor,” the tests check behavior, not strings.
Practically:
- Specify inputs and expected outputs as tables, not prose.
- Include boundary cases explicitly: “what happens at exactly the tier threshold, not just above and below it.”
- Specify what should fail — not just happy paths.
- Never include implementation hints in the spec (“use a dictionary to map…” → “the function should return… [table]”).
What TDD still misses
TDD with AI is not a security review. It’s also not a substitute for the human-oversight discipline covered in the Vibe Coding Survival Guide — the two practices work together. AI-generated tests almost never cover:
- Auth bypass paths
- SQL injection or command injection in input handling
- Privilege escalation edge cases
- Business-logic fraud scenarios unique to your domain
These require adversarial thinking — “what is the worst thing a user could do?” — which AI doesn’t apply unless you explicitly prompt for it. Even then, AI generates the obvious cases (empty string, SQL single-quote injection) and misses context-specific attacks specific to your authorization model.
Security tests are human work. TDD with AI gets you correct behavior under normal conditions; a separate threat-modeling pass gets you resilience under adversarial conditions.
Honest take
The setup cost is 20–30 minutes the first time you configure TDD rules for a project. After that, the per-feature overhead is one extra step: write a spec before prompting the agent.
Whether that tradeoff is worth it depends on what you’re building. For throwaway scripts and personal projects, the tautological test rate matters less — a bug you catch yourself in manual testing costs almost nothing. For production features, especially anything handling money, auth, or user data, the spec-first discipline is the only workflow where AI tools actually increase your confidence rather than inflate your coverage numbers while hiding bugs.
The tool that makes TDD easiest: Aider, because the test loop is automated at the tool level, not the prompt level. The tool with the most friction removed: VS Code Copilot’s three-agent model with phase separation. Cursor gives you the most control if you configure the rules correctly; Cline is the best option if you want Plan mode’s interactive spec review before any code runs.
All of them require that you actually check the tests before advancing phases. The checkpoints only work if you use them.
1V1 STARTER KIT · CURSOR
Skip the week of trial-and-error setting up Cursor.
12 production-tested .cursorrules templates, 3 workflow configs, the cost-control checklist. Everything I wish I had on day one.
Get it for $19 (early bird) →Sources
- Enforcing TDD in Agentic AI CLIs and IDEs — Medium
- My LLM Coding Workflow Going Into 2026 — Addy Osmani
- Rethinking TDD in the Age of AI Code Generation — DEV Community
- Set Up a Test-Driven Development Flow in VS Code — VS Code Docs
- GitHub for Beginners: TDD with GitHub Copilot — GitHub Blog
- Using TDD to Get Better AI-Generated Code — DEV Community
- Aider options reference: —test-cmd, —auto-test — aider.chat
- TDD Mode discussion — cline/cline GitHub Discussions #535
- Cline TDD Starter Template — ppeach/cline-starter-TDD
Last updated May 16, 2026. Tool features change frequently; verify current documentation before adopting any configuration shown here.
Was this article helpful?
Thanks for the feedback — it helps improve future articles.