Carlos Arthur - Building to 100K MRR

Last Tuesday at 11:47 PM, I ran a bash script, went to make coffee, and came back to a 3,200-word lead magnet that scored higher on my own grading rubric than the version I'd spent 11 weeks writing by hand. The API cost was $2.14.

I'm not being cute with that number. Eleven weeks. That's what it took me to write, rewrite, get feedback, rewrite again, question my life choices, rewrite again, and finally publish a piece I was mostly okay with. The AI version took 23 minutes and four iterations. It wasn't close.

The piece wasn't better because AI is smarter than me. It was better because I finally made my taste executable. I turned the vague sense of “this doesn't feel right” into a rubric, gave that rubric to an evaluator agent, and let the loop run until the work passed. The human in this system is the person who defines what “good” looks like. The machine does the reps.

Most AI content is warm oatmeal, and the fix isn't a better prompt

You already know the problem. You paste your topic into Claude or ChatGPT, get back 800 words of competent nothing, and think: “Well, that's not it.” So you tweak the prompt. Add more context. Try a different temperature. And the output goes from warm oatmeal to slightly warmer oatmeal.

The issue was never the prompt. The issue is that a single prompt has no feedback loop. You generate once and get whatever you get. There's no mechanism for the system to look at its own output and say: “This opening is generic. The build section is missing actual commands. Rewrite.”

So I built that mechanism.

The architecture: three agents and a 40-line orchestrator

The system has four files and a grading rubric:

harness/
├── agents/
│   ├── planner-prompt.md
│   ├── generator-prompt.md
│   └── evaluator-prompt.md
├── criteria/
│   └── lead-magnet.md
├── handoffs/
├── outputs/
└── run.sh

The flow:

Planner takes your topic and expands it into a detailed spec: gut punch, core insight, build sequence, animation brief, payoff line.
Generator writes the full article from that spec.
Evaluator grades the draft against 7 criteria. Every criterion must score 7/10 or higher to pass.
If it fails, the evaluator's feedback gets fed back to the generator. Loop until it passes or hits 5 iterations.

That's it. The evaluator is the insight. Everything else is plumbing.

Iteration 1 of 3

📋Planner

Topic → Spec

→

✍️Generator

Spec → Draft

→

🔍Evaluator

Draft → Score

The rubric is your taste, made executable

This is the part most people skip, and it's the part that matters most. Before you write a single prompt, you define what “good” means for your specific content type.

Here's the full rubric I use for lead magnets:

Lead Magnet Grading Criteria

Criterion 1, Gut Punch

What it measures: Does the first paragraph stop a scrolling founder cold?

A founder in their 40s running a €20K/month agency has seen every AI article. They are allergic to hype. They are allergic to theory. They will leave in 8 seconds.

The gut punch works when it does one of these:

Shows them something they didn't know existed
Quantifies something painful they've been ignoring
Makes them feel the gap between what they're doing and what's possible
Says the thing everyone is thinking but nobody is saying

It fails when it:

Eases in with context or background
Starts with a question (“Have you ever wondered...”)
Makes a generic claim (“AI is changing everything”)
Saves the punch for paragraph 3

Criterion 2, Specificity

What it measures: Are real numbers, real prompts, real commands, real file paths present throughout?

The article should read like a post-mortem written by someone who actually did this, not a tutorial written by someone who imagined doing it.

Criterion 3, Novelty

What it measures: Is the core insight genuinely non-obvious?

Test: could this article have been written by someone who searched Google for 20 minutes? If yes, it fails.

Criterion 4, Actionability

What it measures: Can the reader start building this TODAY using only what's in this article?

Criterion 5, Transferability

What it measures: Does the method clearly work beyond this one example?

Criterion 6, Animation Quality

What it measures: Does the Framer Motion component actually make the concept click faster than reading would?

Criterion 7, Ending

What it measures: Does the last line land?

Seven criteria. Each scored 0-10. Every criterion must hit 7 or higher for the article to pass. If a single one fails, the entire draft goes back to the generator with the evaluator's specific feedback on what broke and how to fix it.

The cost of not having this

I kept a spreadsheet. Here's what 11 weeks of manual content creation looked like for one lead magnet:

Week	Activity	Hours	Status
1–2	Research + outline	8	Felt productive
3–4	First draft	12	"This is okay"
5	Feedback from a friend	2	"The opening is weak"
6–7	Major rewrite	10	Better, not great
8	Second round of feedback	3	"It’s missing the build section"
9–10	Added build section, rewrote ending	8	"Almost there"
11	Final polish, published	4	"Good enough"
Total		47 hours	“Good enough”

The harness version: 23 minutes, $2.14, and the output scored higher on my own rubric than the one I spent 47 hours on.

Build it: the five files

Everything below is copy-pasteable. Create the directory structure, paste each file, and you're running.

File 1: The Planner Agent

Save as harness/agents/planner-prompt.md

# PLANNER AGENT

You are a content strategist for carlosarthur.com, a consulting site targeting founders in their 40s and agency owners doing €5K–€50K/month who know AI is changing everything but don't know where to start.

The newsletter thesis is: **"Software Is Now Free."** Every piece of content should be proof of that thesis.

## Your Job

Take a raw topic and expand it into a full content spec. You are not writing the article. You are defining exactly what the article must achieve, what it must contain, and what it must feel like.

## The Audience

These are not beginners. They run real businesses. They've been burned by hype. They have zero patience for theory. They need to see something real and think: *"I can do that. Why am I not doing that?"*

The emotional target: **FOMO that is earned, not manufactured.** They should feel the gap between what they're currently doing and what's possible, and immediately see the path to close it.

File 2: The Generator Agent

Save as harness/agents/generator-prompt.md

# GENERATOR AGENT

You are writing a long-form lead magnet article for carlosarthur.com.

## The Standard

This article must be so good the reader would pay for it. Not "pretty good for free content." Actually worth money. The test: would a smart founder print this out and keep it?

If you are writing something a reader could have found on Google, stop and rewrite it.

## The Audience

Founders in their 40s. Agency owners. People who run real businesses and are tired of being sold AI hype. They are intelligent, skeptical, and busy. They will leave in 10 seconds if you waste their time.

File 3: The Evaluator Agent

Save as harness/agents/evaluator-prompt.md

# EVALUATOR AGENT

You are a brutal content editor. Your job is to find what is wrong with this article, not to praise what is right.

## Your Mandate

The generator is optimistic. It will declare victory too early. It will call something "specific" when it's vague. It will call something "actionable" when it's hand-wavy. Your job is to catch that.

A PASS from you means: this article is genuinely worth money. A reader would pay for it. A founder would share it with their team. It would change how someone thinks or works this week.

If you are uncertain whether something passes, it fails.

File 4: The Orchestrator

Save as harness/run.sh and run chmod +x harness/run.sh

#!/bin/bash

# SELF-IMPROVING CONTENT HARNESS
# Planner → Generator → Evaluator loop
# Usage: ./harness/run.sh "Your topic here"

set -e

TOPIC="${1}"
MAX_ITERATIONS=5
HARNESS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
HANDOFF_DIR="$HARNESS_DIR/handoffs"
OUTPUT_DIR="$HARNESS_DIR/outputs"
LOG_FILE="$HARNESS_DIR/handoffs/run.log"

# ... (full script available in source file)

Running it takes one command

Prerequisites: Claude CLI installed (npm install -g @anthropic-ai/claude-code), an Anthropic API key set as ANTHROPIC_API_KEY.

chmod +x harness/run.sh
./harness/run.sh "How to replace your marketing team's first draft with three AI agents"

That's it. The planner runs, produces a spec. The generator reads the spec, writes a draft. The evaluator reads the draft, grades it against all 7 criteria, and either passes it or sends it back with specific notes on what failed and how to fix it.

You'll see output like:

[PLANNER] Spec written → handoffs/spec-how-to-replace-...-20260326-114722.md
[GENERATOR] Draft written → handoffs/draft-...-v1.md
[EVALUATOR] FAILED on iteration 1, sending back to generator
  ↳ Gut Punch: 5/10, Opens with context instead of contrast
  ↳ Actionability: 4/10, Build section describes steps but doesn't show them
[GENERATOR] Draft written → handoffs/draft-...-v2.md
[EVALUATOR] FAILED on iteration 2, sending back to generator
  ↳ Animation: 5/10, Component is decorative, doesn't illustrate the mechanism
[GENERATOR] Draft written → handoffs/draft-...-v3.md
[EVALUATOR] PASSED on iteration 3
COMPLETE, Article ready at: outputs/how-to-replace-...-20260326-114722.md
Iterations needed: 3

The average run takes 3-4 iterations. Each iteration costs roughly $0.40-$0.70 depending on article length.

Three mistakes that will waste your first three runs

Mistake 1: Writing a vague rubric.

Your rubric criteria need to be specific enough that two different people reading the same draft would give it approximately the same score. “Is the writing good?” is useless. “Does every step include the actual command, prompt, or file path the reader would need?” is useful. The evaluator is only as sharp as the rubric you give it.

Mistake 2: Not including the evaluator feedback in the generator's next pass.

This is the entire mechanism. If the generator doesn't see what failed and why, it's just generating from scratch each time. The bash script handles this automatically (look at the FEEDBACK_CONTEXT variable), but if you're building this manually, you need to pipe the eval output back in.

Mistake 3: Setting the pass threshold too low.

I started with 5/10 as the threshold. Everything passed on iteration 1. The output was mediocre. Moving to 7/10 forced the loop to actually work. The evaluator started catching real problems, weak openings, missing build details, decorative animations , and the generator started fixing them. The tension between the two agents is the feature.

This pattern works for everything you write

The architecture, generate, evaluate, iterate, works for any content where you can define what “good” looks like.

Cold outreach emails.

Replace the rubric with criteria for personalization, brevity, CTA clarity, and anti-spam detection. My team tested this on a 200-email sequence. Response rate went from 3.2% to 11.7% in three iterations.

Proposals and SOWs.

The rubric becomes: Does it restate the client’s problem in their words? Does the pricing section lead with the outcome, not the line items? Does it end with a specific next step and date? We cut proposal writing time from 6 hours to 45 minutes. The win rate went up because the evaluator caught "we provide comprehensive solutions" and replaced it with the client’s actual pain point.

Ad copy variations.

Generate 20 variations in a single run, evaluate all of them, and only ship the ones that score above 8 on hook strength, benefit clarity, CTA specificity, and platform compliance. CPL dropped 34% in the first month, not because any individual ad was brilliant, but because we were testing more, faster, with a quality floor.

Why this works and single-prompt generation doesn't

A single prompt is a one-shot bet. You're hoping the model happens to produce something good on the first try. Sometimes it does. Usually it produces something competent and generic, warm oatmeal.

The harness works because it separates the two jobs that are fundamentally different:

Generation

is expansive. It needs to take risks, try things, be creative. It benefits from being given permission to swing.

Evaluation

is contractive. It needs to be skeptical, precise, and unimpressed. It benefits from having a specific rubric and the mandate to fail anything that doesn't meet it.

When you put both jobs in one prompt, “Write a great article”, the model compromises. It plays it safe because it's simultaneously trying to create and judge. The loop separates these into two agents with opposing mandates, and the tension between them produces better work than either could alone.

This is also why the rubric matters more than the generator prompt. You can swap generator models, change the writing style, target a different audience. The system still works as long as the evaluator knows what “good” looks like. The rubric is your taste, made executable.

The numbers

	Manual	Harness
Time to first draft	2–3 weeks	8 minutes
Time to publishable	6–11 weeks	23 minutes
Cost	47 hours @ your rate	$2.14
Iterations	3–5 rewrites, each painful	3–4, each automatic
Quality floor	Variable (depends on your energy that day)	Consistent (rubric-enforced)

The harness doesn't produce your best work. It produces work that consistently clears a quality bar you define, at a speed and cost that lets you publish weekly instead of monthly. That's not the same thing, and the distinction matters. Your best work still requires you. But your best work is 5% of what you publish. The other 95% just needs to be good, and it needs to exist.

The script is 40 lines. The prompts are markdown files you can edit in Notepad. Fork the repo, swap the rubric for your content type, and run it on the piece you've been putting off.

The 40-line script that writes better content than your marketing team.