Building a CLI tool with Cline using strict TDD: when the discipline pays off

I built a small CLI tool over a week using Cline with strict test-first discipline. The tool: a config file linter that checks for common mistakes in Docker Compose files. About 1200 lines of TypeScript.

The interesting finding: TDD with Cline produces noticeably cleaner code than my normal Cline workflow. The tests act as a forcing function on Cline’s tendency toward sprawl.

The discipline

The rule I held: every feature starts with tests. No exceptions.

The flow:

Write a test that describes a behavior I want
Verify the test fails (the behavior doesn’t exist yet)
Ask Cline to make the test pass
Review Cline’s code
Refactor if needed
Move on

The “ask Cline to make the test pass” step is where Cline operates. The other steps are mine — the test design, the failure verification, the review, the refactor.

Why this changes Cline’s behavior

Without explicit tests, Cline’s autonomous behavior tends toward “make it work somehow.” The agent may add edge case handling I didn’t ask for, defensive code that’s not needed, abstractions that anticipate future needs.

With explicit tests as the spec, Cline’s job is “make these specific assertions pass.” The output is more focused. The defensive code disappears. The unnecessary abstractions go away.

Specifically:

Less code per feature. Without TDD, Cline might write 80 lines of code with edge case handling. With TDD, Cline writes 30 lines that pass the 5 specific tests I wrote. If I want more cases handled, I write more tests.

Less premature abstraction. Cline doesn’t extract helper functions unless multiple tests share patterns. Without TDD, Cline sometimes invents abstractions for one usage.

Less commenting. Cline tends to add comments that explain code. With clear tests, the comments are unnecessary; the test names explain intent. Cline writes fewer comments.

What I tested first

The CLI tool’s structure:

A parser for Docker Compose files
A set of lint rules
A reporter that formats results
A CLI driver that ties it together

For each, I wrote tests covering:

Parser:

Parses valid Docker Compose 3.8
Parses Docker Compose 2.x with backward compatibility
Reports specific errors for common syntax issues

Lint rules (one test per rule):

“Image without tag” rule
“Privileged container” rule
“Missing healthcheck” rule
…etc, about 15 rules total

Reporter:

Formats text output (default)
Formats JSON output
Sorts by severity
Excludes resolved issues

CLI:

Handles —help
Reads file from argument
Reports exit code 0 on no issues, 1 on warnings, 2 on errors
Respects —quiet flag

About 80 tests total. Took me 2 days to write them all (more than I’d budget for production work, but I wanted comprehensive coverage).

The implementation

After tests, Cline implemented over the next 4 days. The pattern:

> /add tests/parser.test.ts
> /add src/parser.ts (currently empty)
> implement the parser to make all tests in parser.test.ts pass.
> Use the js-yaml library. Do not modify the tests.

Cline read the test file, understood the expected behavior from assertions, implemented the parser. About 30 minutes for the parser; tests passed on first attempt.

Repeat for each component. Total Cline time: ~12 hours over 4 days. Most of my time was on test design, code review, and the occasional non-Cline work (project setup, dependency upgrades, etc.).

What surprised me

A few things I noticed:

Cline got better as the test suite grew. With 80 tests covering varied behaviors, Cline could pattern-match my testing style for new tests. When I asked for “tests for the new privilege escalation rule,” Cline produced tests in the same style as the existing ones.

Refactors were trivially safe. With high coverage, I refactored fearlessly. Asked Cline to extract a helper, ran tests, reverted if any failed. Most refactors were one-shot successes.

Bug fixes were structured. When I found a bug, I’d write a failing test first, then ask Cline to fix the underlying code. The discipline meant every bug fix had a regression test.

The codebase stayed clean. No “let me just add this here” creep. Each addition started as a test and went through the discipline.

What didn’t go as well

A few rough edges:

Test design is its own skill. Some tests I wrote were too specific (testing the implementation detail rather than the behavior). When I refactored, the tests broke even though the behavior was correct. Lesson: test behavior, not implementation.

Cline sometimes wrote tests when I didn’t ask. I had to be explicit about “do not modify the tests.” Without that, Cline occasionally tweaked tests to match what it wrote, which defeats the discipline.

The first day was slower than expected. Writing 80 tests took longer than I planned. My “1-week project” was tight; the upfront investment in tests was real.

Type system collisions. With strict TypeScript, some test patterns required type gymnastics that Cline didn’t always get right. Manual fixes required.

The cost

Cline API spend: $18 over the week
My time: ~25 hours (lower than typical for a 1200-line tool because tests carry the design weight)
Test coverage: 95%
Bugs reported after launch: 2 (both edge cases I hadn’t tested)

For comparison, my non-TDD Cline projects of similar size typically:

Take ~30 hours
Have 60-70% test coverage
Have 5-10 bugs reported in the first month

The TDD version was faster, more thoroughly tested, and shipped fewer bugs. The investment in test discipline paid off across all dimensions.

What this taught me

The big lesson: TDD constrains AI tools toward better behavior. Without tests as a spec, AI tools’ tendency toward sprawl can produce verbose, over-abstracted code. With tests as a spec, the AI’s job is bounded; the output is constrained.

This isn’t surprising. TDD produces cleaner code without AI tools too. The interesting finding is that the effect is amplified with AI tools — the speed of generation makes the cleanliness benefit more valuable.

For greenfield work where requirements can be expressed as tests, TDD with Cline (or other AI tools) is genuinely better than non-TDD work. Faster overall, cleaner output, more thorough testing.

For exploratory work where requirements aren’t yet clear, TDD is still hard. Tests assume you know what you want. AI tools can help you figure out what you want; that work doesn’t fit TDD.

When I’d repeat this

For projects where:

Requirements can be expressed precisely
The codebase will be maintained for a while
Quality matters more than experiment speed

For projects where:

Requirements are still being explored
The code may be thrown away
Speed matters more than maintainability

I’d still use AI tools but skip the TDD discipline for the second category.

The pragmatic adoption

For engineers wanting to try this:

Pick a project where requirements are clear (small CLI tools, well-defined APIs, known data transformations)
Write all your tests first, before any implementation
Have AI fill in the implementations
Add the rule “do not modify the tests” to every prompt
Refactor freely once tests pass

The friction is in step 2. Writing comprehensive tests upfront feels slow. The payoff is everything that comes after — fast implementation, clean code, fearless refactoring, low bug rate.

For the right kind of project, this is the most productive way I’ve found to use AI tools. The discipline cost is real; the productivity gain is larger.