I built a small CLI tool over a week using Cline with strict test-first discipline. The tool: a config file linter that checks for common mistakes in Docker Compose files. About 1200 lines of TypeScript.
The interesting finding: TDD with Cline produces noticeably cleaner code than my normal Cline workflow. The tests act as a forcing function on Cline’s tendency toward sprawl.
The discipline
The rule I held: every feature starts with tests. No exceptions.
The flow:
- Write a test that describes a behavior I want
- Verify the test fails (the behavior doesn’t exist yet)
- Ask Cline to make the test pass
- Review Cline’s code
- Refactor if needed
- Move on
The “ask Cline to make the test pass” step is where Cline operates. The other steps are mine — the test design, the failure verification, the review, the refactor.
Why this changes Cline’s behavior
Without explicit tests, Cline’s autonomous behavior tends toward “make it work somehow.” The agent may add edge case handling I didn’t ask for, defensive code that’s not needed, abstractions that anticipate future needs.
With explicit tests as the spec, Cline’s job is “make these specific assertions pass.” The output is more focused. The defensive code disappears. The unnecessary abstractions go away.
Specifically:
Less code per feature. Without TDD, Cline might write 80 lines of code with edge case handling. With TDD, Cline writes 30 lines that pass the 5 specific tests I wrote. If I want more cases handled, I write more tests.
Less premature abstraction. Cline doesn’t extract helper functions unless multiple tests share patterns. Without TDD, Cline sometimes invents abstractions for one usage.
Less commenting. Cline tends to add comments that explain code. With clear tests, the comments are unnecessary; the test names explain intent. Cline writes fewer comments.
What I tested first
The CLI tool’s structure:
- A parser for Docker Compose files
- A set of lint rules
- A reporter that formats results
- A CLI driver that ties it together
For each, I wrote tests covering:
Parser:
- Parses valid Docker Compose 3.8
- Parses Docker Compose 2.x with backward compatibility
- Reports specific errors for common syntax issues
Lint rules (one test per rule):
- “Image without tag” rule
- “Privileged container” rule
- “Missing healthcheck” rule
- …etc, about 15 rules total
Reporter:
- Formats text output (default)
- Formats JSON output
- Sorts by severity
- Excludes resolved issues
CLI:
- Handles —help
- Reads file from argument
- Reports exit code 0 on no issues, 1 on warnings, 2 on errors
- Respects —quiet flag
About 80 tests total. Took me 2 days to write them all (more than I’d budget for production work, but I wanted comprehensive coverage).
The implementation
After tests, Cline implemented over the next 4 days. The pattern:
> /add tests/parser.test.ts
> /add src/parser.ts (currently empty)
> implement the parser to make all tests in parser.test.ts pass.
> Use the js-yaml library. Do not modify the tests.
Cline read the test file, understood the expected behavior from assertions, implemented the parser. About 30 minutes for the parser; tests passed on first attempt.
Repeat for each component. Total Cline time: ~12 hours over 4 days. Most of my time was on test design, code review, and the occasional non-Cline work (project setup, dependency upgrades, etc.).
What surprised me
A few things I noticed:
Cline got better as the test suite grew. With 80 tests covering varied behaviors, Cline could pattern-match my testing style for new tests. When I asked for “tests for the new privilege escalation rule,” Cline produced tests in the same style as the existing ones.
Refactors were trivially safe. With high coverage, I refactored fearlessly. Asked Cline to extract a helper, ran tests, reverted if any failed. Most refactors were one-shot successes.
Bug fixes were structured. When I found a bug, I’d write a failing test first, then ask Cline to fix the underlying code. The discipline meant every bug fix had a regression test.
The codebase stayed clean. No “let me just add this here” creep. Each addition started as a test and went through the discipline.
What didn’t go as well
A few rough edges:
Test design is its own skill. Some tests I wrote were too specific (testing the implementation detail rather than the behavior). When I refactored, the tests broke even though the behavior was correct. Lesson: test behavior, not implementation.
Cline sometimes wrote tests when I didn’t ask. I had to be explicit about “do not modify the tests.” Without that, Cline occasionally tweaked tests to match what it wrote, which defeats the discipline.
The first day was slower than expected. Writing 80 tests took longer than I planned. My “1-week project” was tight; the upfront investment in tests was real.
Type system collisions. With strict TypeScript, some test patterns required type gymnastics that Cline didn’t always get right. Manual fixes required.
The cost
- Cline API spend: $18 over the week
- My time: ~25 hours (lower than typical for a 1200-line tool because tests carry the design weight)
- Test coverage: 95%
- Bugs reported after launch: 2 (both edge cases I hadn’t tested)
For comparison, my non-TDD Cline projects of similar size typically:
- Take ~30 hours
- Have 60-70% test coverage
- Have 5-10 bugs reported in the first month
The TDD version was faster, more thoroughly tested, and shipped fewer bugs. The investment in test discipline paid off across all dimensions.
What this taught me
The big lesson: TDD constrains AI tools toward better behavior. Without tests as a spec, AI tools’ tendency toward sprawl can produce verbose, over-abstracted code. With tests as a spec, the AI’s job is bounded; the output is constrained.
This isn’t surprising. TDD produces cleaner code without AI tools too. The interesting finding is that the effect is amplified with AI tools — the speed of generation makes the cleanliness benefit more valuable.
For greenfield work where requirements can be expressed as tests, TDD with Cline (or other AI tools) is genuinely better than non-TDD work. Faster overall, cleaner output, more thorough testing.
For exploratory work where requirements aren’t yet clear, TDD is still hard. Tests assume you know what you want. AI tools can help you figure out what you want; that work doesn’t fit TDD.
When I’d repeat this
For projects where:
- Requirements can be expressed precisely
- The codebase will be maintained for a while
- Quality matters more than experiment speed
For projects where:
- Requirements are still being explored
- The code may be thrown away
- Speed matters more than maintainability
I’d still use AI tools but skip the TDD discipline for the second category.
The pragmatic adoption
For engineers wanting to try this:
- Pick a project where requirements are clear (small CLI tools, well-defined APIs, known data transformations)
- Write all your tests first, before any implementation
- Have AI fill in the implementations
- Add the rule “do not modify the tests” to every prompt
- Refactor freely once tests pass
The friction is in step 2. Writing comprehensive tests upfront feels slow. The payoff is everything that comes after — fast implementation, clean code, fearless refactoring, low bug rate.
For the right kind of project, this is the most productive way I’ve found to use AI tools. The discipline cost is real; the productivity gain is larger.