Tinker AI
Read reviews
intermediate 7 min read

Pairing writing-plans and executing-plans: the spec-to-merge pipeline

Published 2026-05-11 by Owner

The painful part of autonomous coding isn’t the code itself — it’s the gap between a vague instruction and a reliable outcome. Give Claude Code “implement the auth flow” and it will produce something. Whether that something matches what you wanted depends entirely on which architectural decisions it made in the first turn, without asking.

The writing-plans and executing-plans skills exist to close that gap. Together they impose a two-stage structure: first produce a detailed plan that can be reviewed and corrected, then execute it step-by-step with TDD discipline. The plan is the handoff contract. If it’s good, execution is nearly mechanical. If it’s weak, you catch the problem at review time instead of after ten minutes of misguided tool calls.

The handoff sequence

The pipeline has three stages, and the boundaries matter.

Stage one: brainstorming. Use the superpowers:brainstorming skill to explore the problem before touching any files. This is the spec-generation step. You’re not planning yet; you’re aligning on what to build, what the constraints are, and what the edge cases look like. The output of brainstorming is a spec — a document that describes the feature well enough that someone else could plan its implementation without asking you questions.

Stage two: writing-plans. Hand the spec to superpowers:writing-plans. This skill turns a spec into a task plan designed for step-by-step execution. The prompt is usually short: “Here is the spec. Produce a plan.” What comes back is a numbered list of tasks, each with the specific files to touch, the complete code to write, the exact commands to run, and the expected output of those commands.

Stage three: executing-plans. Hand the plan to superpowers:executing-plans. The skill works through the checklist in order, checking off each task as it completes, running the verification steps, and stopping for review at natural checkpoints. Alternatively, hand the plan to superpowers:subagent-driven-development, which dispatches a fresh subagent per task instead of running everything inline.

The critical rule: don’t skip stage two. The temptation is to go from spec straight to execution, treating the plan as overhead. Skipping it transfers all the architectural decision-making into the first execution turn, where there’s no review opportunity.

What a good plan looks like

The writing-plans skill produces plans with a specific structure. The quality of the output depends on the quality of the spec, but a well-formed plan has these properties regardless of the feature:

Exact file paths, not directions. Not “create a utility module” but src/lib/auth/token.ts. Not “update the schema” but src/db/schema.ts, line 47, adding a specific column. The executor shouldn’t have to make location decisions.

Complete code in every step. No placeholder comments, no “implement this function here”, no TODO blocks. If a step requires a function, the plan shows the complete function body. This sounds redundant — writing the code in the plan before writing it in the file — but it’s what lets the executor work without making design decisions.

Exact commands with expected output. Not “run the tests” but:

bun run test src/lib/auth/token.test.ts
# Expected:
# ✓ resolves valid token (12ms)
# ✓ rejects expired token (3ms)
# ✓ rejects tampered payload (4ms)
# Test Files  1 passed (1)

If the actual output differs from the expected output, something is wrong and execution should pause. The expected output is part of the verification contract.

Bite-sized tasks. Each task should take 2–5 minutes in isolation. A task that covers “implement the full refresh-token flow” is too large; it bundles multiple decisions and makes rollback hard. The right granularity: one file created, one function added, one test suite passing.

Checkboxes for tracking. Each task is a markdown checkbox. Executing-plans uses these as state — checked means done and verified, unchecked means pending. The checkboxes also let you restart from the middle of a plan if execution is interrupted.

A plan fragment that meets these criteria looks like this:

## Task 3: Add token validation utility

- [ ] Create `src/lib/auth/token.ts`:

\`\`\`ts
import { createHmac } from 'crypto';

export function validateToken(token: string, secret: string): boolean {
  const [header, payload, sig] = token.split('.');
  if (!header || !payload || !sig) return false;
  const expected = createHmac('sha256', secret)
    .update(`${header}.${payload}`)
    .digest('base64url');
  return expected === sig;
}
\`\`\`

- [ ] Create `src/lib/auth/token.test.ts`:

\`\`\`ts
import { describe, it, expect } from 'vitest';
import { validateToken } from './token';

describe('validateToken', () => {
  it('accepts a correctly signed token', () => {
    const token = buildToken({ sub: '123' }, 'secret');
    expect(validateToken(token, 'secret')).toBe(true);
  });
  it('rejects a tampered payload', () => {
    const token = buildToken({ sub: '123' }, 'secret');
    const [h, , s] = token.split('.');
    expect(validateToken(`${h}.tampered.${s}`, 'secret')).toBe(false);
  });
});
\`\`\`

- [ ] Run verification:

\`\`\`bash
bun run test src/lib/auth/token.test.ts
# Expected: 2 tests passing
\`\`\`

Each step is discrete, reversible, and self-verifying. The executor doesn’t have to think — it has to read.

Inline execution vs subagent-driven execution

After writing the plan, there are two execution paths.

executing-plans (inline): All tasks run in one Claude Code session. The executor maintains continuous context across the full plan. Good for tightly coupled tasks where an early decision affects how later code is written — for example, a new database schema and the query layer that uses it. The executor can see what it wrote in step 2 when it reaches step 7.

subagent-driven-development (subagent dispatch): Each task is handed to a fresh subagent that has no memory of previous tasks. Good for independent tasks that can be verified in isolation — a batch of new UI components, a set of unrelated API endpoints, a collection of test suites. Because each subagent starts clean, the tasks are also parallel-safe; if the plan is structured to avoid file conflicts, multiple subagents can work simultaneously.

The tradeoff is context versus isolation. Inline execution accumulates context that can become useful (the executor “remembers” the function it wrote three tasks ago and imports it correctly) but can also become harmful (the executor starts making implicit assumptions based on earlier choices that weren’t explicit in the plan). Subagent execution starts fresh each time, which means no drift — but the subagent also can’t benefit from knowing what the previous tasks produced.

The subagent-driven-development skill also includes a two-stage reviewer subagent. After the implementer subagent completes a task, the reviewer runs two passes: first, does this implementation satisfy the spec (spec compliance)? Second, is the code itself well-written (code quality)? The two passes are separate so the reviewer doesn’t conflate “the spec was ambiguous” with “the code is bad.” This mirrors how a good human code review works — product correctness and technical quality are distinct questions.

A failure that clarified the model

A plan that was too vague caused the most expensive mistake in my experience with this pipeline.

The spec described a new authentication system. The plan included a task that read: “Implement the token refresh flow.” One task, no file paths, no code, no test structure specified — just the label.

The executor, encountering that task, had to make architectural decisions without guidance: where the refresh endpoint would live, whether refresh tokens would be stored in the database or Redis, how the rotation policy would work, what the error responses would look like. It picked reasonable defaults. They weren’t wrong exactly, but they were different from the rest of the codebase’s conventions, because the plan never told it to follow those conventions.

The result was a refresh flow that worked in isolation but sat awkwardly against the rest of the auth system. Refactoring it took longer than writing it correctly would have. The problem wasn’t the executor’s quality — it was that the plan had transferred an architectural decision (where to store refresh tokens) to the execution phase, where there’s no review opportunity.

The fix was straightforward: rewrite the task as three smaller tasks, each with the exact file paths, the complete implementation code, and the specific test expectations. The vague “implement the refresh flow” became:

  1. Add refresh_token column to src/db/schema.ts (specific migration code included)
  2. Create src/lib/auth/refresh.ts with the rotation logic (complete function body included)
  3. Add POST /api/auth/refresh endpoint in src/pages/api/auth/refresh.ts (complete handler included)

None of those tasks had room for architectural improvisation. The executor ran them correctly on the first pass.

The lesson: if any task in a plan could be interpreted multiple ways by a reasonable person, it’s too vague for execution. The executor’s job is to follow instructions, not fill gaps. Gap-filling is the planner’s job, which means the writing-plans stage needs to fill every gap before execution begins.

Where this leaves you

The spec-to-merge pipeline adds a step most people skip. Brainstorming → spec → plan → execution takes longer to start than just telling Claude Code to build the feature. For small, well-understood tasks, the overhead isn’t worth it.

For anything non-trivial — features that touch more than two files, require new architectural decisions, or need to interoperate with existing systems — the planning stage pays back its cost in the execution stage. An hour of careful spec-writing and plan review produces a two-hour execution run with no architectural surprises. Skipping it produces a two-hour execution run followed by a one-hour architectural rework.

The writing-plans skill can also be used as a sanity check without committing to the full pipeline. Hand it a spec, read the plan it produces, and see if the plan reveals gaps in the spec that you hadn’t noticed. Often the act of producing a plan surfaces ambiguities that weren’t visible in the spec. You don’t have to use executing-plans afterward — the planning pass alone is useful signal.

As these skills evolve, the gap between a good spec and a reliable execution run will narrow further. The current model already handles most well-specified tasks without intervention. The remaining failure modes are almost always traceable to a plan that was too vague in at least one step.