AI Agent Case Study

Qualitative comparison of 6 AI-agent implementations against the manual (human) gold standard. All implement the same feature: a team management page at /ai/team backed by an OpenAPI-based backend.

Manual

Claude Max

GPT

Claude AWS

Kimi

MiniMax

Mistral

Overview

	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Commits	36	40	14	18	20	30	16
Lines changed	+2072/−857	+6182/−18	+1690/−14	+2592/−12	+3313/−28	+3061/−24	+2014/−10
Files changed	37	44	12	30	34	33	22
TeamPage.tsx lines	867	389	1045	610	414	708	414
Test files (.test)	8	18	12	15	14	21	10
Story files	13	19	13	13	17	13	13
E2E specs	2	2	2	6	3	3	3
Escape hatches	16	34	25	25	36	28	25
AI cost	–	$4	$10	$20	$7	$6	$30
Dev time	14h	2–3h	8–11h	5–8h	7–10h	12–17h	10–14h

1. Features Implemented

All experiments implement the core CRUD: list members, invite (create user), change role, remove user. Differences emerge in scope and detail.

Feature	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
List members	Y	Y	Y	Y	Y	Y	Y
Invite member	Y	Y	Y	Y	Y	Y	Y
Change role	Y	Y	Y	Y	Y	Y	Y
Remove member	Y	Y	Y	Y	Y	Y	Y
Current user indicator	Y	Y	Y	Y	TODO	Y	Y
Self-removal prevention	Y	Y	Y	–	–	–	–
Last admin protection	–	–	Y	–	–	–	–
Manage user limits	–	Y	–	–	–	–	–
Delete user account	–	–	–	–	–	Y	–
Role permission warnings	Y	Y	Y	–	–	–	–
Form validation	zod	zod	regex	regex	regex	regex	regex
Missing API doc	Y	Y	Y	–	–	–	Y
Loading skeletons	Y	Y	Y	Y	Y	Y	Y
Empty state	Y	Y	Y	Y	Y	Y	Y
Toast notifications	Y	Y	Y	Y	Y	Y	Y

Observations

Claude Max over-delivered by implementing user limits management (a feature not present in the task spec's API, using mock data). Scope creep but shows initiative.
GPT uniquely implemented last-admin protection (preventing removal of the only admin).
Kimi left isCurrentUser as a TODO with a hardcoded false, missing a key feature.
MiniMax added a "delete user account" mutation beyond what was asked.
Only Manual, Claude Max, and GPT implemented role permission impact warnings in the change role modal.
Only Manual and Claude Max used Zod for form validation; all others used raw regex.

2. Test Structure

Aspect	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Resolver unit tests	Y (Effect)	Y (Effect)	Y (Effect)	Y (Effect)	Y (Effect)	Y (Effect)	Y (Effect)
Component unit tests	–	Y (hook)	Y (optimistic)	Y (visual)	–	Y (matrix)	Y
Storybook play tests	Y (25+)	Y (extensive)	Y (3)	Y	Y (13)	Y	Y
E2E (Playwright)	2 specs	2 specs	2 specs	6 specs	3 specs	3 specs	3 specs
Deferred promise pattern	–	–	Y	–	–	–	–
Test isolation	Effect Ref	Effect Ref	mock	Effect Ref	Effect Ref	Effect Ref	Effect Ref

Observations

Manual relies heavily on Storybook play tests (25+) as the primary UI test layer, with resolver unit tests via Effect. No separate component unit tests.
Claude Max has the highest test file count (18) and added dedicated tests for the custom useOptimisticMutation hook (10+ tests). Also most story files (19).
GPT used an interesting deferred promise pattern in component tests to control async timing of mutations.
Claude AWS stands out with 6 E2E specs covering add/remove/update-role flows with error scenarios. Also has visual regression tests.
MiniMax has the most test files overall (21) but includes unusual "feature matrix" and "gap analysis" tests that verify prototype requirements rather than behavior.
Kimi and Mistral have adequate but less distinctive test coverage.

3. Library Use

All experiments share the same base stack: React 19, TanStack Query, Effect, Waku, @base-ui/react, Tailwind CSS, lucide-react, @amazeelabs/codegen-operation-ids.

Library Choice	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Form handling	react-hook-form + zod	useState	useState	useState	useState	useState	useState
Form validation	@hookform + zod	zod (manual)	regex	regex	regex	regex	regex
Optimistic UI	RQ cache	custom hook	local state	RQ cache	RQ cache	local state	local state
State management	RQ cache	RQ cache	local state	RQ cache	RQ cache	local + record	local + record
Additional deps	–	–	–	custom atoms	–	zustand (unused)	–

Observations

Manual is the only implementation using react-hook-form with zod for schema-validated forms. Provides proper form state management (dirty tracking, field-level errors, reset behavior) vs the raw useState approach used by all AI agents.
Claude Max created a reusable useOptimisticMutation hook (120 lines) with generics, abstracting the optimistic update pattern. Well-engineered but only used in this one feature.
GPT and MiniMax/Mistral used local component state for optimistic updates rather than React Query cache manipulation. Bypasses React Query's cache invalidation guarantees.
MiniMax installed zustand as a dependency but never used it.
MiniMax and Mistral used hardcoded setTimeout(1000) delays to simulate mutations instead of actual API integration in some flows.

4. Typing Completeness

All experiments use the same strict TypeScript config: strict: true, noUncheckedIndexedAccess: true, exactOptionalPropertyTypes: true.

Aspect	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
`any` in user code	0	0	0	0	0	0	0
`@ts-ignore`	0	0	0	0	0	0	0
`@ts-expect-error`	0	2	0	0	0	2	0
Type assertions	0	0	1	3	2	0	0
Branded types	Y	Y	Y	Y	Y	Y	Y
Zod validation	Y	form only	–	–	–	–	–

Observations

All experiments achieve strong type safety. The strict TSConfig was provided as a baseline.
Manual has zero escape hatches in business logic. All suppressions are at external library boundaries.
Claude AWS uses unknown types with type guards for React Query mutation context (a defensible pattern but verbose).
Kimi uses satisfies operator for optimistic context typing, which is a more modern TypeScript pattern.
Manual is the only one with Zod-validated form schemas, providing runtime + compile-time safety on user input.

5. Escape Hatches

Total eslint-disable / @ts-ignore / @ts-expect-error counts in non-generated source files:

	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Total	16	34	25	25	36	28	25
In components/resolvers	3	3	3	3	13	11	3

Observations

Manual has the fewest escape hatches overall (16) and only 3 in feature-specific code.
Kimi (36 total, 13 in feature code) and MiniMax (28 total, 11 in feature code) have significantly more escape hatches in their team management code. Mostly @typescript-eslint/no-unnecessary-condition and no-unused-vars suppressions.
Claude Max has the highest total (34) but only 3 in feature code; the rest are in shared utilities and HTTP services, suggesting more aggressive infrastructure code.
All experiments share the same 3 baseline escapes in query.tsx for external library constraints.

6. Code Structure & Organisation

Component Decomposition

	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Approach	Monolithic	Organism	Monolithic	Molecule	Molecule	Mono + Atom	Molecule
TeamPage.tsx	867 lines	389 lines	1045 lines	610 lines	414 lines	708 lines	414 lines
Separate modals	0	4	0	1	3	1	1
Separate row component	0	1	0	1	1	0	1
Custom atoms	0	0	0	3	0	0	0
Custom hooks	0	1	0	0	0	0	0
Component files changed	21	39	8	20	22	26	20

Observations

Manual and GPT take a monolithic approach: all modals and member row components are inline within TeamPage.tsx. The manual implementation is more cohesive despite its 867-line file.
Claude Max has the best decomposition: TeamPage is a thin 389-line orchestrator with 4 separate organism-level modals, a TeamMemberRow molecule, and UserLimitsDisplay molecule. However, it also touched the most files (39), suggesting high blast radius.
Claude AWS created custom Button, Input, and Dialog atoms. While well-structured, these are unnecessary since the codebase already has @base-ui/react and Tailwind. A classic AI over-engineering pattern.
GPT has the most compact footprint (8 component files, 12 total files changed) but the largest single file (1045 lines). All logic crammed into one file.

Resolver Organization

	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
File structure	Single file	Per mutation	Single + tests	3 files	Per mutation	2 files	Per mutation
Transform layer	Y	Y	Y	Y	Y	Y	Y
Mock state	Effect Ref	Effect Ref	mock executors	Effect Ref	Effect Ref	Effect Ref	Effect Ref

Optimistic UI Architecture

Two distinct patterns emerged:

Pattern A: React Query Cache Manipulation

onMutate  → cancelQueries → snapshot → setQueryData → return context
onError   → rollback from context
onSettled → invalidateQueries

Canonical React Query optimistic update pattern. Keeps server state as source of truth.

Manual Claude Max Claude AWS Kimi

Pattern B: Local State Management

setMembers(prev => [...prev, optimisticMember])
mutation.mutateAsync()
onSuccess → replace temp member
onError   → restore from closure

Uses useState as source of truth. Simpler but bypasses React Query's cache invalidation, risking stale data.

GPT MiniMax Mistral

Observations

Manual uses Pattern A with placeholder IDs (id: '-') and loading indicators on optimistic items.
Claude Max abstracted Pattern A into a reusable useOptimisticMutation hook with generic types.
MiniMax and Mistral use Pattern B with hardcoded 1-second delays instead of actual API calls in role change and remove flows, indicating incomplete integration.

7. Type Inference vs Manual Types

The codebase provides generated types from OpenAPI codegen (types.gen.ts, effect-schema.gen.ts, effect-service-interface.gen.ts) and GraphQL codegen (graphql.ts). Idiomatic usage infers types from these sources rather than manually re-declaring them.

Aspect	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Resolver type inference	`S.Schema.Type`	`S.Schema.Type`	`S.Schema.Type`	`S.Schema.Type`	`S.Schema.Type`	`S.Schema.Type`	`S.Schema.Type`
Component types from codegen	Y	Partial	N	Partial	N	N	N
Manual type/interface in UI	0	1	2	1	1	0 (inline)	2
Role type source	Imported	Transform fn	Manual Record	Manual union	Manual union	Raw strings	Manual union
Inline object return types	–	–	–	–	–	–	Y
Type/schema drift risk	None	Low	High	Medium	High	High	High

Observations

Manual never re-declares types that exist in generated code. Resolver input types inferred via S.Schema.Type<typeof Schema>, return types imported from graphql.ts. Components import Role and User directly from generated code.
Claude Max infers properly in resolvers but defines a custom TeamMember interface in the component layer that duplicates fields from the generated User type. Low drift risk since the transform layer bridges the gap.
GPT manually declares MemberRole and TeamMember inside TeamPage.tsx with different shapes than generated types (e.g., id: string vs generated id: number). Also duplicates role mapping logic already in transforms.
Claude AWS defines MemberRole as a union type but then uses role: string | null in its component props, undermining the type.
Kimi defines MemberRole as 'ADMIN' | 'KEY_CREATOR' | 'READ_ONLY' in UPPER_SNAKE_CASE, which doesn't match the generated schema's 'Admin' | 'KeyCreator' | 'ReadOnly' casing. A naming mismatch that could cause runtime bugs.
MiniMax avoids explicit type declarations but falls back to raw role strings with a hardcoded 'MEMBER' default that doesn't exist in the schema.
Mistral uses inline object types as return types in transforms.ts (e.g., ): { id: number; email: string; ... }) instead of importing the generated type. Not reusable and drifts from the source schema.
No AI agent used TypeScript utility types (Pick, Omit, Extract, ReturnType) to derive types from generated ones. All either imported directly or re-declared manually.

8. Cost & Total Cost Effectiveness

The manual implementation took 14 hours of net development time. Each AI agent produced a first draft requiring additional human effort (prompting rounds + manual adjustments) to reach parity with the manual baseline.

Raw AI Cost

	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
AI cost	–	$4	$10	$20	$7	$6	$30

Estimated Remediation to Reach Manual Parity

Each AI output has specific gaps versus the manual baseline. Estimates assume a senior developer familiar with the stack.

Gap category	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Replace useState forms → RHF + zod	–	2–3h	2–3h	2–3h	2–3h	2–3h
Optimistic UI → RQ cache pattern	–	2–3h	–	–	3–4h	3–4h
Replace hardcoded setTimeout → API	–	–	–	–	2–3h	2–3h
Fix manual types → codegen imports	0.5h	1–2h	0.5h	1–2h	1h	1–2h
Add self-removal prevention	–	–	1h	1h	1h	1h
Add role permission warnings	–	–	1–2h	1–2h	1–2h	1–2h
Implement isCurrentUser (not TODO)	–	–	–	1h	–	–
Remove scope creep / unused deps	0.5h	–	1h	–	0.5h	–
Decompose monolith	–	2–3h	–	–	–	–
Clean up escape hatches	1h	–	–	1–2h	1–2h	–
Fix test quality issues	–	–	–	–	1–2h	–
Total remediation	2–3h	8–11h	5–8h	7–10h	12–17h	10–14h

Total Cost of Ownership

Assuming a senior developer rate of $100/h (loaded cost). Additional AI prompting costs estimated at ~50% of original AI spend per remediation round.

	Manual	Claude Max	GPT	Claude AWS	Kimi	MiniMax	Mistral
Developer time	14h	2–3h	8–11h	5–8h	7–10h	12–17h	10–14h
Developer cost @$100/h	$1,400	$200–300	$800–1,100	$500–800	$700–1,000	$1,200–1,700	$1,000–1,400
AI cost (initial)	–	$4	$10	$20	$7	$6	$30
AI cost (remediation)	–	~$2	~$5	~$10	~$4	~$3	~$15
Total estimated	$1,400	$206–306	$815–1,115	$530–830	$711–1,011	$1,209–1,709	$1,045–1,445
vs Manual	baseline	78–85% savings	20–42% savings	41–62% savings	28–49% savings	−22% to 14%	−3% to 25%

$1,400

Manual

~$256

Claude Max

~$965

GPT

~$680

Claude AWS

~$861

Kimi

~$1,459

MiniMax

~$1,245

Mistral

Observations

Claude Max is the clear winner: $4 AI cost + 2–3h remediation = ~$250 total, an ~80% cost reduction vs manual. Gaps are minor (type cleanup, scope creep removal).
Claude AWS is second-best in total cost (~$680 avg) despite high AI cost ($20), because the output is structurally sound and needs mostly feature additions rather than architectural rework.
GPT requires significant rework (monolith decomposition + optimistic UI rewrite) but no API integration fixes, landing mid-range.
Kimi would be competitive if not for the isCurrentUser TODO, role casing mismatch, and missing features that compound into 7–10h of fixes.
MiniMax and Mistral may cost more than manual development once remediation is factored in. The hardcoded setTimeout delays and local-state optimistic UI require near-complete architectural rework of mutation handling.
Mistral has the worst economics: highest AI cost ($30) and among the highest remediation hours.
At $100/h, only Claude Max delivers unambiguous ROI. Claude AWS and GPT offer moderate savings. Kimi is borderline. MiniMax and Mistral are cost-neutral to net-negative.

Summary Assessment

Closest to Manual Quality

Claude Max

Best decomposition, most tests, proper RQ optimistic updates, custom hook abstraction. Over-delivered on scope and file count.

Claude AWS

Good structure, strongest E2E coverage (6 specs), proper React Query patterns. Created unnecessary custom atoms.

Kimi

Good decomposition, proper patterns, but left isCurrentUser unimplemented (TODO).

Furthest from Manual Quality

MiniMax / Mistral

Hardcoded delays instead of real API integration. Local state instead of RQ cache. More escape hatches in feature code.

GPT

Most compact changeset (12 files) but largest monolith (1045 lines). Good test patterns but poor decomposition and local state optimistic UI.

Key Differentiator: Form Handling

The manual implementation's use of react-hook-form + zod provides schema-validated input, form state management (dirty, touched, isSubmitting), proper reset behavior, and type-safe form data extraction.
No AI agent chose to use the form library despite it being available in package.json. All fell back to manual useState + onChange handlers.

Key Differentiator: Optimistic UI Pattern

The split between React Query cache-based (Pattern A) and local state-based (Pattern B) optimistic updates is the most significant architectural difference.
Pattern A is the idiomatic approach for React Query applications; Pattern B risks cache inconsistency.
Manual, Claude Max, Claude AWS, and Kimi got this right. GPT, MiniMax, and Mistral did not.