Me vs. the machine

2026-02-16

It has been pretty much a year since I started using Claude Code heavily for development tasks and as I already documented, it changed my workflows significantly. And it did improve productivity, but mostly based on gut feelings, which are as reliable as my estimations[1]. So I figured it is about time to test my belly's decision making in a scientifically absolutely bulletproof way[2].

The experiment nobody asked for#

There were a couple of questions that have been floating around in my mind for some time:

  • Is it really faster than writing code myself?
  • What is the quality difference?
  • What is the effective cost difference between manual and agentic coding?
  • How much would my process cost outside of a Claude Max plan?
  • Does a good feedback loop allow us to use a cheaper/less capable model?
  • Will I run a coding model in my basement in the future?

I very rarely build bouncing ball simulations or minecraft clones in my day job[3], so I needed a test case that is an inch closer to a real world task. When I started working on a new "GraphQL-middleware plus React-UI" project, I saw my chance. The architecture is custom enough to see how well the different models handle a challenge, but close enough to their standard training data that they should get pretty far:

  • A custom GraphQL schema that unifies legacy APIs
  • A GraphQL server exposing that schema, resolvers are built using Effect
  • A React client application to interact with the whole thing

The fundamentals of the project already existed. This is no green-field experiment. The assignment was to add a user management page, based on a vibe-coded React mockup provided by our UX expert. Invite new users by email, change their role and remove them. Sounds easy, but as always the challenge is not the typing, but navigating the circumstances. On the one hand, the legacy API has a couple of blank spots, like a role property on the user object that is just a string and does not tell us which roles are available. On the other hand, the mockup is very aspirational, and contains a lot of features that are currently not even supported by the target service. I wanted to know how the agent would deal with these kinds of obstacles, which in reality are slightly more common than the contrived examples we find on YouTube whenever a new model comes around.

How to torture 6 AI agents (and myself)#

Partially for the sake of this study, I vibed up a little tool[4] that creates an agent agnostic cli interface that allows me to use the same workflow with different agents:

  1. I create a TODO.md file at the repository root.
  2. I call the tool in plan mode. It will read my TODO-prompt, explore the project and flesh out a plan.
  3. I review the plan and provide feedback.
  4. The agent updates the plan and we iterate until it's done.
  5. I call the tool in build mode and it will iterate in a Ralph-Loop through all items.
  6. After each TODO-item, it will run the full test suite and feed errors back into the agent to fix them. Repeat until all pass.
  7. I come back to a green project that maybe does what I expect. Or not. Then I do explorative testing and code reviews.

I put quite extensive documentation on coding standards into the repository. Clear guidelines on which libraries to use for which task, preferences on where typing had to happen and which tests to write. Also, there already were Effect based GraphQL resolvers implemented, but no forms yet (that's important for later).

Then, I ran the same workflow with the following agent configurations:

  • Opus 4.5 with Claude Code on Max Plan: The setup I primarily used for a long time. This is on the one hand an expectation bar for the others to compare against, and on the other I want to find out how much Anthropic is subsidising it.
  • GPT Codex 5.2 with OpenCode on OpenCode Zen: Mainly to test the biggest competitor, since I have not been using Codex myself so far.
  • Opus 4.5 with OpenCode on amazee.ai: To get the "raw" price for calculating the difference to Claude Max.
  • Kimi 2.5 with OpenCode on OpenCode Zen: Open source model that could potentially run locally.
  • Minimax 2.1 with OpenCode on OpenCode Zen: Other open source model that could potentially run locally.
  • Mistral Devstral with OpenCode on La Plateforme: European option.
  • Philipp 43 with Neovim on Coffee: To establish a "gold standard" and compare all others cost-wise.

I gave each agent configuration a chance to solve the task, including 2-3 feedback loops if they didn't get everything right on the first shot. Although I did not do a detailed code review, to not skew the numbers towards my manual implementation that I approached right after[5].

After the manual coding sprint, I did a rough review of each implementation to get a feeling for it and forced Claude to do a detailed comparison of the different versions, with specific focus on the following metrics:

  • features implemented
  • test structure
  • library use
  • typing approach
  • optimistic ui patterns

Then I also asked Claude to estimate the remediation effort for each version to achieve architectural parity to my manual implementation. That one should be taken with a grain of salt, but it's an estimation, so it's probably as good as my own 🤷‍♂️.

The damage report#

You can look at the detailed report here, but I'll just summarize the most important findings.

Opus 4.5 with Claude Code on Max Plan#

Generated the biggest diff, but that - somewhat surprisingly - did not diminish quality of the result too much. It came closest to the manual implementation, but failed in properly scoping out a feature that the backend API does not support yet. It ate up around 20% of the weekly limit of my $100 max plan, which comes down to roughly $4 of AI cost.

GPT Codex 5.2 with OpenCode on OpenCode Zen#

This was a fun one. Codex wrote an extremely short and hand-wavy plan (around 10% of the others) and the result stunned me at first. The UI was pixel-perfect, down to each little detail. And all the interactions worked. I even slacked a colleague that the race is over, but after that I saw that there are no network requests 🤯 It just completely skipped the backend/API part and put everything into React states 🤣. But - after a polite nudge that this is not as production ready as it claimed - it delivered a decent implementation of the backend as well. The total tracked token cost came down to $10.

Opus 4.5 with OpenCode on amazee.ai#

Outside of Claude Code, Anthropic's model seems to be just slightly less capable. The biggest mistake it made, compared to the "official" version, was to use regexes for form validation instead of Zod. The total AWS cost mounted up to $20 though, which gives us an idea how much marketing expenses Anthropic is accounting for.

Kimi 2.5 with OpenCode on OpenCode Zen#

This was the star of the show in my opinion. Kimi delivered an almost Opus-like result, without being a proprietary locked model. It was very slow compared to other models, but since the whole point is that I'm not staring at it until it's done, I do not care if it takes 15 or 45 minutes. The token cost for the task was $7 which is above the subsidised Claude Max Plan, but even below Codex in raw tokens.

Minimax 2.1 with OpenCode on OpenCode Zen#

Minimax is way cheaper than Kimi, and that's why I specifically wanted to pitch the two of them against each other. Unfortunately it was not able to materialize the potential. The implementation had to do a lot more feedback and test fix loops, which resulted in a $6 bill for a solution that in the end was not even 100% functional. A cheaper model does not automatically mean lower cost. Just like with humans 😈. But in the meantime, Minimax 2.5 was released and I heard a lot of good things about it. So don't count them out.

Mistral Devstral with OpenCode on La Plateforme#

Devstral was unfortunately kind of a letdown. After spending $30 in tokens, the solution was still not even close to the competitors. So much for our European solution.

What I learned (besides that I'm expensive)#

A summary of the "estimated total cost"[6]:

ManualClaude MaxGPTClaude AWSKimiMiniMaxMistral
Dev time14h2.5h9.5h6.5h8.5h14.5h12h
Dev cost$1,400$250$950$650$850$1,450$1,200
AI cost (initial)-$4$10$20$7$6$30
AI cost (cleanup)-~$2~$5~$10~$4~$3~$15
Total$1,400$256$965$680$861$1,459$1,245
vs Manual-82%31%51%39%-4%11%

That both versions of Claude came out on top is probably no conincidence, since this was also the model doing the estimation (I should have anonymized the versions beforehand 🤦‍♂️). Also, the development hours for remediating feel a little high for me, but I don't have a great track record of correct estimations either, so I'll just leave them like that. But despite the fuzzy result it was a fun ride, and I took away a couple of learnings. Some were already expected - now at least somewhat proven - others surprised me. Lets return to the initial questions:

  • Is it really faster than writing code myself?
    Yes, it is.
  • What is the quality difference?
    With proper guardrails, the quality even tends to be better, since I have more time for polishing.
  • What is the effective cost difference between manual and agentic coding?
    Depending on the task it can be pretty drastic. Up to 80% is no joke.
  • How much would my process cost outside of a Claude Max plan?
    Five times as much, given I would pay tokens for the same model. But open source models change that equation.
  • Does a good feedback loop allow us to use a cheaper/less capable model?
    Unfortunately not. I theorized that - given there is a loop that just re-runs the agent until it's good - we could trade longer execution times for using smaller, less capable models. But eventually the token consumption increases significantly, and it won't be cheaper.
  • Will I run a coding model in my basement in the future? Potentially. My fingers are itching to order a Framework Desktop, but who knows how the world looks like in a year. And an investment with a projected three-year amortization period is just too long right now. We are living in crazy times.

But there is one more answer to a question I did not ask:

Skills and documentation are way less important than existing code. All models did a good job creating Effect-based GraphQL resolvers (which were already present in the codebase), while they also all completely ignored my documented instructions to use react-hook-form and Zod for form handling. This means that the real strength of agentic development is not in the shiny one-shot-vibe-prompt everybody seems to be chasing. Those might work, but without proper guidance the agent will spiral into a mess. Ironically we should do the exact opposite. Craft really good initial projects that check all boxes in terms of quality, so the genie has something worthy to replicate over and over. And that's what we will still need engineers for.