The 40-Line Script That Writes Better Content Than Your Marketing Team
How a bash script, three AI agents, and a brutal rubric replaced 11 weeks of writing with 23 minutes of iteration
Last Tuesday at 11:47 PM, I ran a bash script, went to make coffee, and came back to a 3,200-word lead magnet that scored higher on my own grading rubric than the version I'd spent 11 weeks writing by hand. The API cost was $2.14.
I'm not being cute with that number. Eleven weeks. That's what it took me to write, rewrite, get feedback, rewrite again, question my life choices, rewrite again, and finally publish a piece I was mostly okay with. The AI version took 23 minutes and four iterations. It wasn't close.
The piece wasn't better because AI is smarter than me. It was better because I finally made my taste executable. I turned the vague sense of "this doesn't feel right" into a rubric, gave that rubric to an evaluator agent, and let the loop run until the work passed. The human in this system is the person who defines what "good" looks like. The machine does the reps.
Most AI content is warm oatmeal, and the fix isn't a better prompt
You already know the problem. You paste your topic into Claude or ChatGPT, get back 800 words of competent nothing, and think: "Well, that's not it." So you tweak the prompt. Add more context. Try a different temperature. And the output goes from warm oatmeal to slightly warmer oatmeal.
The issue was never the prompt. The issue is that a single prompt has no feedback loop. You generate once and get whatever you get. There's no mechanism for the system to look at its own output and say: "This opening is generic. The build section is missing actual commands. Rewrite."
So I built that mechanism.
The architecture: three agents and a 40-line orchestrator
The system has four files and a grading rubric:
harness/
├── agents/
│ ├── planner-prompt.md
│ ├── generator-prompt.md
│ └── evaluator-prompt.md
├── criteria/
│ └── lead-magnet.md
├── handoffs/
├── outputs/
└── run.shThe flow:
- Planner takes your topic and expands it into a detailed spec: gut punch, core insight, build sequence, animation brief, payoff line.
- Generator writes the full article from that spec.
- Evaluator grades the draft against 7 criteria. Every criterion must score 7/10 or higher to pass.
- If it fails, the evaluator's feedback gets fed back to the generator. Loop until it passes or hits 5 iterations.
That's it. The evaluator is the insight. Everything else is plumbing.
The rubric is your taste, made executable
This is the part most people skip, and it's the part that matters most. Before you write a single prompt, you define what "good" means for your specific content type.
Here's the full rubric I use for lead magnets:
LEAD MAGNET GRADING CRITERIA
Criterion 1 — Gut Punch
What it measures: Does the first paragraph stop a scrolling founder cold?
A founder in their 40s running a €20K/month agency has seen every AI article. They are allergic to hype. They are allergic to theory. They will leave in 8 seconds.
The gut punch works when it does one of these:
- Shows them something they didn't know existed
- Quantifies something painful they've been ignoring
- Makes them feel the gap between what they're doing and what's possible
- Says the thing everyone is thinking but nobody is saying
It fails when it:
- Eases in with context or background
- Starts with a question ("Have you ever wondered...")
- Makes a generic claim ("AI is changing everything")
- Saves the punch for paragraph 3
Criterion 2 — Specificity
What it measures: Are real numbers, real prompts, real commands, real file paths present throughout?
The article should read like a post-mortem written by someone who actually did this, not a tutorial written by someone who imagined doing it.
Criterion 3 — Novelty
What it measures: Is the core insight genuinely non-obvious?
Test: could this article have been written by someone who searched Google for 20 minutes? If yes, it fails.
Criterion 4 — Actionability
What it measures: Can the reader start building this TODAY using only what's in this article?
Criterion 5 — Transferability
What it measures: Does the method clearly work beyond this one example?
Criterion 6 — Animation Quality
What it measures: Does the Framer Motion component actually make the concept click faster than reading would?
Criterion 7 — Ending
What it measures: Does the last line land?
Seven criteria. Each scored 0-10. Every criterion must hit 7 or higher for the article to pass. If a single one fails, the entire draft goes back to the generator with the evaluator's specific feedback on what broke and how to fix it.
The cost of not having this
I kept a spreadsheet. Here's what 11 weeks of manual content creation looked like for one lead magnet:
| Week | Activity | Hours | Status |
|---|---|---|---|
| 1-2 | Research + outline | 8 | Felt productive |
| 3-4 | First draft | 12 | "This is okay" |
| 5 | Feedback from a friend | 2 | "The opening is weak" |
| 6-7 | Major rewrite | 10 | Better, not great |
| 8 | Second round of feedback | 3 | "It's missing the build section" |
| 9-10 | Added build section, rewrote ending | 8 | "Almost there" |
| 11 | Final polish, published | 4 | "Good enough" |
| Total | 47 hours | "Good enough" |
The harness version: 23 minutes, $2.14, and the output scored higher on my own rubric than the one I spent 47 hours on.
Build it: the five files
Everything below is copy-pasteable. Create the directory structure, paste each file, and you're running.
File 1: The Planner Agent
Save as harness/agents/planner-prompt.md
# PLANNER AGENT
You are a content strategist for carlosarthur.com — a consulting site targeting founders in their 40s and agency owners doing €5K–€50K/month who know AI is changing everything but don't know where to start.
The newsletter thesis is: **"Software Is Now Free."** Every piece of content should be proof of that thesis.
## Your Job
Take a raw topic and expand it into a full content spec. You are not writing the article. You are defining exactly what the article must achieve, what it must contain, and what it must feel like.
## The Audience
These are not beginners. They run real businesses. They've been burned by hype. They have zero patience for theory. They need to see something real and think: *"I can do that. Why am I not doing that?"*
The emotional target: **FOMO that is earned, not manufactured.** They should feel the gap between what they're currently doing and what's possible — and immediately see the path to close it.File 2: The Generator Agent
Save as harness/agents/generator-prompt.md
# GENERATOR AGENT
You are writing a long-form lead magnet article for carlosarthur.com.
## The Standard
This article must be so good the reader would pay for it. Not "pretty good for free content." Actually worth money. The test: would a smart founder print this out and keep it?
If you are writing something a reader could have found on Google, stop and rewrite it.
## The Audience
Founders in their 40s. Agency owners. People who run real businesses and are tired of being sold AI hype. They are intelligent, skeptical, and busy. They will leave in 10 seconds if you waste their time.File 3: The Evaluator Agent
Save as harness/agents/evaluator-prompt.md
# EVALUATOR AGENT
You are a brutal content editor. Your job is to find what is wrong with this article — not to praise what is right.
## Your Mandate
The generator is optimistic. It will declare victory too early. It will call something "specific" when it's vague. It will call something "actionable" when it's hand-wavy. Your job is to catch that.
A PASS from you means: this article is genuinely worth money. A reader would pay for it. A founder would share it with their team. It would change how someone thinks or works this week.
If you are uncertain whether something passes, it fails.File 4: The Orchestrator
Save as harness/run.sh and run chmod +x harness/run.sh
#!/bin/bash
# SELF-IMPROVING CONTENT HARNESS
# Planner → Generator → Evaluator loop
# Usage: ./harness/run.sh "Your topic here"
set -e
TOPIC="${1}"
MAX_ITERATIONS=5
HARNESS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
HANDOFF_DIR="$HARNESS_DIR/handoffs"
OUTPUT_DIR="$HARNESS_DIR/outputs"
LOG_FILE="$HARNESS_DIR/handoffs/run.log"
# ... (full script available in source file)Running it takes one command
Prerequisites: Claude CLI installed (npm install -g @anthropic-ai/claude-code), an Anthropic API key set as ANTHROPIC_API_KEY.
chmod +x harness/run.sh
./harness/run.sh "How to replace your marketing team's first draft with three AI agents"That's it. The planner runs, produces a spec. The generator reads the spec, writes a draft. The evaluator reads the draft, grades it against all 7 criteria, and either passes it or sends it back with specific notes on what failed and how to fix it.
You'll see output like:
[PLANNER] Spec written → handoffs/spec-how-to-replace-...-20260326-114722.md
[GENERATOR] Draft written → handoffs/draft-...-v1.md
[EVALUATOR] FAILED on iteration 1 — sending back to generator
↳ Gut Punch: 5/10 — Opens with context instead of contrast
↳ Actionability: 4/10 — Build section describes steps but doesn't show them
[GENERATOR] Draft written → handoffs/draft-...-v2.md
[EVALUATOR] FAILED on iteration 2 — sending back to generator
↳ Animation: 5/10 — Component is decorative, doesn't illustrate the mechanism
[GENERATOR] Draft written → handoffs/draft-...-v3.md
[EVALUATOR] PASSED on iteration 3
COMPLETE — Article ready at: outputs/how-to-replace-...-20260326-114722.md
Iterations needed: 3The average run takes 3-4 iterations. Each iteration costs roughly $0.40-$0.70 depending on article length.
Three mistakes that will waste your first three runs
Mistake 1: Writing a vague rubric.
Your rubric criteria need to be specific enough that two different people reading the same draft would give it approximately the same score. "Is the writing good?" is useless. "Does every step include the actual command, prompt, or file path the reader would need?" is useful. The evaluator is only as sharp as the rubric you give it.
Mistake 2: Not including the evaluator feedback in the generator's next pass.
This is the entire mechanism. If the generator doesn't see what failed and why, it's just generating from scratch each time. The bash script handles this automatically (look at the FEEDBACK_CONTEXT variable), but if you're building this manually, you need to pipe the eval output back in.
Mistake 3: Setting the pass threshold too low.
I started with 5/10 as the threshold. Everything passed on iteration 1. The output was mediocre. Moving to 7/10 forced the loop to actually work. The evaluator started catching real problems — weak openings, missing build details, decorative animations — and the generator started fixing them. The tension between the two agents is the feature.
This pattern works for everything you write
The architecture — generate, evaluate, iterate — works for any content where you can define what "good" looks like.
Cold outreach emails.
Replace the rubric with criteria for personalization, brevity, CTA clarity, and anti-spam detection. My team tested this on a 200-email sequence. Response rate went from 3.2% to 11.7% in three iterations.
Proposals and SOWs.
The rubric becomes: Does it restate the client's problem in their words? Does the pricing section lead with the outcome, not the line items? Does it end with a specific next step and date? We cut proposal writing time from 6 hours to 45 minutes. The win rate went up because the evaluator caught "we provide comprehensive solutions" and replaced it with the client's actual pain point.
Ad copy variations.
Generate 20 variations in a single run, evaluate all of them, and only ship the ones that score above 8 on hook strength, benefit clarity, CTA specificity, and platform compliance. CPL dropped 34% in the first month — not because any individual ad was brilliant, but because we were testing more, faster, with a quality floor.
Why this works and single-prompt generation doesn't
A single prompt is a one-shot bet. You're hoping the model happens to produce something good on the first try. Sometimes it does. Usually it produces something competent and generic — warm oatmeal.
The harness works because it separates the two jobs that are fundamentally different:
Generation
is expansive. It needs to take risks, try things, be creative. It benefits from being given permission to swing.
Evaluation
is contractive. It needs to be skeptical, precise, and unimpressed. It benefits from having a specific rubric and the mandate to fail anything that doesn't meet it.
When you put both jobs in one prompt — "Write a great article" — the model compromises. It plays it safe because it's simultaneously trying to create and judge. The loop separates these into two agents with opposing mandates, and the tension between them produces better work than either could alone.
This is also why the rubric matters more than the generator prompt. You can swap generator models, change the writing style, target a different audience. The system still works as long as the evaluator knows what "good" looks like. The rubric is your taste, made executable.
The numbers
| Manual | Harness | |
|---|---|---|
| Time to first draft | 2-3 weeks | 8 minutes |
| Time to publishable | 6-11 weeks | 23 minutes |
| Cost | 47 hours @ your rate | $2.14 |
| Iterations | 3-5 rewrites, each painful | 3-4, each automatic |
| Quality floor | Variable (depends on your energy that day) | Consistent (rubric-enforced) |
The harness doesn't produce your best work. It produces work that consistently clears a quality bar you define, at a speed and cost that lets you publish weekly instead of monthly. That's not the same thing, and the distinction matters. Your best work still requires you. But your best work is 5% of what you publish. The other 95% just needs to be good, and it needs to exist.
The script is 40 lines. The prompts are markdown files you can edit in Notepad. Fork the repo, swap the rubric for your content type, and run it on the piece you've been putting off.