I Blind-Tested 3 AI Models on My Own Client Documents. My Favorite Lost.

❝

Key Takeaway: "Felt smarter" is brand bias. Public leaderboards rank models on other people's prompts; the only test that matters runs on your own documents with the labels hidden and the verdict rule locked before you look. When the labels came off mine, the expensive default had lost 77% of the picks, and the test cost about $2 to run.

Hi {{first_name | Reader}},

For the past year I've paid for the most expensive AI model on the market and used it for everything.

I never tested that choice. It felt smarter, so it stayed.

Then I put it on trial, against my own client documents, with the labels hidden.

I picked against it 77% of the time.

How the default formed

The expensive model writes beautifully. Give it a messy board transcript and it returns prose you'd sign your name under. So every task went to it: meeting summaries, email analysis, client briefs, all of it.

Somewhere along the way, a habit dressed itself up as a decision.

I was choosing a model the way most people choose wine. By the label, and by the price.

What forced the test

Last month I wrote to you that the model is a commodity and the archive is the moat.

A careful reader could have asked one uncomfortable question: did you ever test that?

I hadn't. So I built the test.

The design

I pulled 22 real documents from my archive. Seven board and CFO meeting transcripts. Eight client email threads. The rest a mix, including the four documents I'd rank as the hardest in the set: long, messy, multilingual, full of half-finished decisions.

Every document was anonymized with a placeholder map before any model saw it.

Three models took the test. My expensive default. Its mid-tier sibling at a fraction of the price. And a wildcard model that costs less than half a cent per document.

Each document went through all three. The outputs came back labeled A, B and C, shuffled differently for every document, so I couldn't learn the pattern.

And before reading a single output, I wrote down the verdict rule. What "winning" would mean, in numbers, locked in advance. No moving the goalposts after seeing who scored.

Then I read. Sixty-six outputs over a few sittings, picking the best answer for each document and marking which others were acceptable.

Some answers felt premium. Confident, polished, expensive-sounding. I caught myself thinking "that's my model" more than once.

I was wrong about half the time I thought that.

The transcripts

The board transcripts were where the test earned its keep.

A two-hour meeting transcript is mostly noise. Somewhere inside it are three numbers that matter and one decision nobody wrote down. The model's whole job is to hear the thing in the transcript that nobody else is listening for.

That's what I scored. Which model heard it.

The labels come off

The mid-tier sibling took 13 of the 22 best picks. The expensive default took 5. The wildcard took 4.

On the four hardest documents, the mid-tier model went four for four.

The verdict rule I'd locked before reading a single output returned one word: switch.

Names, since I'm asking you for yours below: the default was Claude Opus. The sibling that beat it is Sonnet. The wildcard was DeepSeek.

Why your benchmark is lying to you

Public leaderboards rank models on other people's prompts, averaged across tasks that aren't yours. They measure a model's talent for the average question. You don't pay for average questions. You pay for your documents.

Brand bias does the rest. The expensive model sounds smarter, so you never check whether it is. The polish is doing the work the accuracy should be doing.

The fix is structural, and it's old. Wine people solved it decades ago: hide the labels, decide the criteria first, then taste.

A question for you

Which model do you default to? And when did you last test it against your own work, blind?

Reply with your default. I read every reply.

Below the line

Below the line: the exact test design, ready to copy. The six steps, the pre-registered verdict rule verbatim, the scorecard, and what the whole thing cost to run. Spoiler on that last one: about $2.

Get the full test design →

My favorite model lost.

How the default formed

What forced the test

The design

Reading blind

The transcripts

The labels come off

Why your benchmark is lying to you

A question for you

Below the line

Keep Reading

Alex the CFO

My favorite model lost.

How the default formed

What forced the test

The design

Reading blind

The transcripts

The labels come off

Why your benchmark is lying to you

A question for you

Below the line

Subscribe to keep reading

Keep Reading

Alex the CFO