Me vs. the machine
It has been pretty much a year since I started using Claude Code heavily for development tasks and as I already documented, it changed my workflows significantly. And it did improve productivity, but mostly based on gut feelings, which are as reliable as my estimations[1]. So I figured it is about time to test my belly's decision making in a scientifically absolutely bulletproof way[2].
The experiment nobody asked for#
There were a couple of questions that have been floating around in my mind for some time:
I very rarely build bouncing ball simulations or minecraft clones in my day job[3], so I needed a test case that is an inch closer to a real world task. When I started working on a new "GraphQL-middleware plus React-UI" project, I saw my chance. The architecture is custom enough to see how well the different models handle a challenge, but close enough to their standard training data that they should get pretty far:
The fundamentals of the project already existed. This is no green-field
experiment. The assignment was to add a user management page, based on a
vibe-coded React mockup provided by our UX expert. Invite new users by email,
change their role and remove them. Sounds easy, but as always the challenge is
not the typing, but navigating the circumstances. On the one hand, the legacy
API has a couple of blank spots, like a role property on the user object that
is just a string and does not tell us which roles are available. On the other
hand, the mockup is very aspirational, and contains a lot of features that are
currently not even supported by the target service. I wanted to know how the
agent would deal with these kinds of obstacles, which in reality are slightly
more common than the contrived examples we find on YouTube whenever a new model
comes around.
How to torture 6 AI agents (and myself)#
Partially for the sake of this study, I vibed up a little tool[4] that creates an agent agnostic cli interface that allows me to use the same workflow with different agents:
I put quite extensive documentation on coding standards into the repository. Clear guidelines on which libraries to use for which task, preferences on where typing had to happen and which tests to write. Also, there already were Effect based GraphQL resolvers implemented, but no forms yet (that's important for later).
Then, I ran the same workflow with the following agent configurations:
I gave each agent configuration a chance to solve the task, including 2-3 feedback loops if they didn't get everything right on the first shot. Although I did not do a detailed code review, to not skew the numbers towards my manual implementation that I approached right after[5].
After the manual coding sprint, I did a rough review of each implementation to get a feeling for it and forced Claude to do a detailed comparison of the different versions, with specific focus on the following metrics:
Then I also asked Claude to estimate the remediation effort for each version to achieve architectural parity to my manual implementation. That one should be taken with a grain of salt, but it's an estimation, so it's probably as good as my own 🤷♂️.
The damage report#
You can look at the detailed report here, but I'll just summarize the most important findings.
Opus 4.5 with Claude Code on Max Plan#
Generated the biggest diff, but that - somewhat surprisingly - did not diminish quality of the result too much. It came closest to the manual implementation, but failed in properly scoping out a feature that the backend API does not support yet. It ate up around 20% of the weekly limit of my $100 max plan, which comes down to roughly $4 of AI cost.
GPT Codex 5.2 with OpenCode on OpenCode Zen#
This was a fun one. Codex wrote an extremely short and hand-wavy plan (around 10% of the others) and the result stunned me at first. The UI was pixel-perfect, down to each little detail. And all the interactions worked. I even slacked a colleague that the race is over, but after that I saw that there are no network requests 🤯 It just completely skipped the backend/API part and put everything into React states 🤣. But - after a polite nudge that this is not as production ready as it claimed - it delivered a decent implementation of the backend as well. The total tracked token cost came down to $10.
Opus 4.5 with OpenCode on amazee.ai#
Outside of Claude Code, Anthropic's model seems to be just slightly less capable. The biggest mistake it made, compared to the "official" version, was to use regexes for form validation instead of Zod. The total AWS cost mounted up to $20 though, which gives us an idea how much marketing expenses Anthropic is accounting for.
Kimi 2.5 with OpenCode on OpenCode Zen#
This was the star of the show in my opinion. Kimi delivered an almost Opus-like result, without being a proprietary locked model. It was very slow compared to other models, but since the whole point is that I'm not staring at it until it's done, I do not care if it takes 15 or 45 minutes. The token cost for the task was $7 which is above the subsidised Claude Max Plan, but even below Codex in raw tokens.
Minimax 2.1 with OpenCode on OpenCode Zen#
Minimax is way cheaper than Kimi, and that's why I specifically wanted to pitch the two of them against each other. Unfortunately it was not able to materialize the potential. The implementation had to do a lot more feedback and test fix loops, which resulted in a $6 bill for a solution that in the end was not even 100% functional. A cheaper model does not automatically mean lower cost. Just like with humans 😈. But in the meantime, Minimax 2.5 was released and I heard a lot of good things about it. So don't count them out.
Mistral Devstral with OpenCode on La Plateforme#
Devstral was unfortunately kind of a letdown. After spending $30 in tokens, the solution was still not even close to the competitors. So much for our European solution.
What I learned (besides that I'm expensive)#
A summary of the "estimated total cost"[6]:
| Manual | Claude Max | GPT | Claude AWS | Kimi | MiniMax | Mistral | |
|---|---|---|---|---|---|---|---|
| Dev time | 14h | 2.5h | 9.5h | 6.5h | 8.5h | 14.5h | 12h |
| Dev cost | $1,400 | $250 | $950 | $650 | $850 | $1,450 | $1,200 |
| AI cost (initial) | - | $4 | $10 | $20 | $7 | $6 | $30 |
| AI cost (cleanup) | - | ~$2 | ~$5 | ~$10 | ~$4 | ~$3 | ~$15 |
| Total | $1,400 | $256 | $965 | $680 | $861 | $1,459 | $1,245 |
| vs Manual | - | 82% | 31% | 51% | 39% | -4% | 11% |
That both versions of Claude came out on top is probably no conincidence, since this was also the model doing the estimation (I should have anonymized the versions beforehand 🤦♂️). Also, the development hours for remediating feel a little high for me, but I don't have a great track record of correct estimations either, so I'll just leave them like that. But despite the fuzzy result it was a fun ride, and I took away a couple of learnings. Some were already expected - now at least somewhat proven - others surprised me. Lets return to the initial questions:
But there is one more answer to a question I did not ask:
Skills and documentation are way less important than existing code. All models did a good job creating Effect-based GraphQL resolvers (which were already present in the codebase), while they also all completely ignored my documented instructions to use react-hook-form and Zod for form handling. This means that the real strength of agentic development is not in the shiny one-shot-vibe-prompt everybody seems to be chasing. Those might work, but without proper guidance the agent will spiral into a mess. Ironically we should do the exact opposite. Craft really good initial projects that check all boxes in terms of quality, so the genie has something worthy to replicate over and over. And that's what we will still need engineers for.
Footnotes#
-
Which are not and never will be. [↩]
-
With the statistically highly significant control group of me, myself and I. [↩]
-
If I had a night job other than dealing with insomniac infants, it wouldn't be that either. [↩]
-
Which turned out very useful also outside the study. When polished and publishable it will get its own blogpost. [↩]
-
It's always easier when you have seen every mistake first. The classic "We re-wrote our app in [hot new framework] and it's much better!" bias. [↩]
-
The $100 per development-hour is a value Claude came up with itself, and it's correctness depends on the viewpoint. But it is clear that the inference cost is largely irrelevant in this calculation. Even if the hourly rate is significantly lower. [↩]