How development using AI agents compares to a human developer speed
Recently I was lucky enough to run an experiment of developing a project using purely AI tools.
Instead of a team of developers it was only me, with AI tools of my choice. At the moment the experiment is 4 months old :) and continues, with pure effort spent on this project I'd say at least 3 months and 1 week.
There were complications involving me needing to touch other services within the company codebase, owned by other teams, some of them critical like payments, which took a good piece of overall timeline, but the core is a single service with front pages, auth hidden pages, admin, some internal APIs, workers and cron jobs. So pretty much typical project.
At some point in time I started to doubt whether I output results at the full possible speed. A lot of people saying that they did with AI something x10 faster than it would be done without before, so reasonably this project shouldn't take 40 developer/months without using AI agents. I changed approaches and AI tools during this experiment, and at the moment I have a setup where I'm fully dedicated to review of that gigantic AI output and I cannot find where I can find x2..3...4...etc speedup.
And while measuring of the development speed is something that's not solved well in the industry at the moment I still want to try to compare a process before heavily relying on AI agents and current approaches in terms of how much time it saves. Typical approaches taken in the organization to measure velocity wouldn't work here as I do not have story points, estimations in hours, team, etc.. So I'll try to measure a one task purely relying on my understanding of my process of development with AI agents and without.
Setup the stage
First of all it is important to agree on setup and some simplification, given low predictability of the real world. I will outline the task and describe setup of the project and development process done by human directly and the one done with AI agents.
The project
Imagine an existing project, where developer is aware of the codebase, setup and technologies used. It is a NodeJS backend, with PostgreSQL DB and ReactJS based frontend with MobX as a state manager. Code is written in TypeScript. There are around 30k LOC already in the project, with existing data in DB, major entities already exist. And let's add a spice of the real world, the project depends on a couple of internal non-public libraries, including components of UI design system that must be utilized. Build, deployment automation all are known and existing, and can be just used as is.
The task
Let's imagine that a developer needs to build a simple feature that touches all the components - public facing page where user can search the entities that are lying in one table in DB. Search by some string query, a couple of filters like greater / lower prices and similar ones and with paging.
The page has reference design, but not full, missing some details, like mobile views and maybe some reactions, etc. so it specifies the target, but still has some spaces to be figured out during development.
Let's scope this feature to a simplest implementation, no string performance measurements and UX features, we'll even skip for the sake of this article reflecting state in the URL.
Pretty much clear what needs to be changed / added:
- API that takes user input as request and returns paged data. Request with filters should be validated, under the hood it should go to DB with a valid and safe query to run
- Frontend including page on some route of application, MobX store that holds a state being a mediator between rendering and backend
- Integrating all together, including due diligence tasks like error handling and testing
The setup
Since it is existing project let's use pretty much common setups for manual development and AI agent based development
- Tests exist both integration and unit tests for both frontend and backend code.
- There is linting and automated formatting.
- There is ability to run the project locally with all the needed components.
- The frontend has hot-reload feature, so the developer can see updates right away for most of the visual changes.
For the human development we use IDE a developer is experienced and proficient with. Tests and formatting are automatically run after every change the human does so we have as fast feedback during development as possible, where a developer is focusing only on producing code.
For AI agent based development we are going to utilize the same setup, but with additions and modifications:
- AI agent guidelines exist and are polished. For instance CLAUDE.md for claude code which is tuned for being followed well on the previous tasks
- Guidelines instruct AI agent to run formatting / lint / build / test to validate result and that any issues must be fixed. What could be fixed automatically, i.e. formatting is fixed automatically.
- Prepared allow list of the commands AI agent can run without asking permission
So basically a setup for AI agent is done in a way that a developer can give a task and expect at least a result that can be run right away. There are no limits on AI agent tokens usage and cost. But there are hard limits of current state of LLM context window, i.e. 200k tokens for Claude models.
Let's do imaginary "manual" development following the developer actions
This part is becoming tricky. I have to do some approximations and assumptions here. The final goal is to measure time spent for development, but just pure number throwing does not look right at the moment, because it will be an expert measurement and as I told at the beginning I started to doubt a bit in my expertise.
So let's imagine categories of tasks that a developer does with some basic time spent. Basis is for one piece of the category, but we can apply multipliers later on.
Short Name | Estimate | Description | Example |
---|---|---|---|
Thinking | 10m | Planning the next step or initial brainstorming. | Sketching a feature plan |
Idle | 5m | Taking a break, such as a bio-pause or refreshing. | Grabbing a coffee, existential crisis about code quality, restroom break |
Declarative Coding | 5m | Writing code without runtime logic—mainly contracts or type/interface declarations for APIs. | Defining a TypeScript interface |
Runtime Coding | 15m | Implementing executable code and test cases that verify the implementation works correctly. | Writing a JS function with accompanying unit tests |
UI Coding | 15m | Creating purely UI-related code, such as HTML/CSS, and visually confirming its correctness. | Coding a button style in a React component |
Reading Documentation | 10m | Reviewing relevant documentation or component/library designs to inform development. | Reading API docs before using a library |
Refactoring | 10m | Improving existing code structure/quality without changing its behavior. | Renaming variables or reorganizing functions |
Looks like we have a decent list to start from so let's move to the manual development breakdown
Human takes the lead
So let's imagine a human workflow
- Initial planning - On the tasks as we start, initial one could be a bit more complex - so x2 (Thinking)
- Backend API design - Agreed to start from backend and do some declarative coding of the API. Request, response, some sub contracts x3 (Declarative Coding)
- API validation - Now move to implementation - validation for API which is runtime part x1 (Runtime Coding)
- Database filter logic - Implementation of the filters on the existing data in DB. So we need to write some SQL / ORM queries testing it right away, given that we have a couple of different filters and paging let's say x3 (Runtime Coding)
- Code cleanup - Given that we did a bit of mess to get everything to work we spend a bit on refactoring x1 (Refactoring)
- Break time - Huh, nice run, it works, let's do a small break x1 (Idle)
- Next step planning - Now let's think on the next step x1 (Thinking)
- Frontend route setup - Next step is defined we start to work on frontend, added a new route for the page with some boilerplate, checking that it works, which is UI coding x1 (UI Coding)
- API contracts for frontend - So we need to fetch data but before we need a shape for API contracts and state, which goes into Declarative Coding x2 (Declarative Coding)
- Store implementation - Now we can draft a basic store that holds state of the filters, fetches data from API, holding that data as state and reloading it when related state is changed. It is a Runtime Coding x3 (Runtime Coding)
- Basic data rendering - Let's bind store to some UI components and render just a data first. But we need to find a proper component and understand how to use it which is reading documentation x1 and we can do UI coding x1. Boom! We have first visual results, and it even works (shocking, I know) (Reading Documentation + UI Coding)
- Filter UI planning - Think about the UI approach for filters - what components we need, how to structure the search and filter interface x1 (Thinking)
- Design system research - Need to read documentation of the internal org UI design system library to find the right filter components and understand how to use them properly x3 (Reading Documentation)
- Filter components - Now we need to implement the actual search and filter UI components - input fields, dropdowns, etc. This requires some design decisions and component selection x3 (UI Coding)
- Store integration - We need to wire up the form inputs to the MobX store state, handling user interactions and triggering API calls on changes x2 (Runtime Coding)
- Backend bug discovery - During testing we discover some bugs in our backend filter logic - need to debug and figure out what's wrong x1 (Thinking)
- Backend fixes - Fix the backend issues we found during frontend testing x1 (Runtime Coding)
- UX improvements - Let's add proper loading states, error handling, and empty states to make the UX smooth x3 (Runtime Coding)
- Pagination controls - Time to implement pagination controls and wire them to the store x1 (UI Coding)
- Paging bug fix - Spotted a bug with paging implementation on backend during frontend testing, but it's an easy fix x1 (Runtime Coding)
- Long break - Some point in the implementation, time to take a break after exhausting frontend work and questioning life choices x3 (Idle)
- Post-break planning - Come back, thinking on what is next x1 (Thinking)
- Responsive styling - Looking at the reference design, we need to style everything properly and make it responsive for mobile x3 (UI Coding)
- Performance optimization - We notice some performance issues with rapid filter changes, so let's add debouncing to the search input x1 (Runtime Coding)
- Component refactoring - Quick refactoring session to clean up component structure and extract some reusable pieces x1 (Refactoring)
- Error handling UI - Simulating errors on UI and adding error boundaries and proper styles for error messaging x2 (UI Coding)
- Final testing - Final manual testing across different screen sizes and edge cases, fixing small bugs discovered x2 (UI Coding)
- Code review prep - Code review prep - cleaning up console.logs, adding comments, ensuring code standards x1 (Refactoring)
- Final verification - Small thinking session to verify we haven't missed anything from requirements x1 (Thinking)
Summary Table
Task Type | Time per Unit | Count | Total Time |
---|---|---|---|
Thinking | 10m | 6 | 60m |
Idle | 5m | 4 | 20m |
Declarative Coding | 5m | 5 | 25m |
Runtime Coding | 15m | 12 | 180m |
UI Coding | 15m | 12 | 180m |
Reading Documentation | 10m | 4 | 40m |
Refactoring | 10m | 3 | 30m |
Grand Total: 535 minutes = 8.92 hours or ~9 hours
Now let's do the same using AI agent
As for the human we should do some approximation and simplification, but let's try to stick to at least some reality norms.
I will describe also from my experience working with AI agents at the moment of article writing which is in my case it is either JetBrains Junie or Claude Code. At the moment they are quite similar in terms of output, where Junie slightly outperforms Claude in terms of quality of the output and Claude in terms of developer experience and speed. Nevertheless when we use AI coding agents with a setup tuned for them the typical flow would be to throw into AI agent a task, review, if it is too far away, iterate with AI again, if it is mostly good it might require some slight manual tuning at the end. Also not to forget that every AI agent loop will end up with validation like running build / tests / formatting / lints with possible additional loops to self-correct. Also some activities will be the same as AI agents are not endlessly powerful. To simplify let's assume that any AI generation ends up with a human review, that depends on the size of generated code and on bigger output it will take more time than on smaller ones.
Let's try to outline our interaction type during AI Agent based development.
Short Name | Estimate | Description | Example |
---|---|---|---|
Thinking | 5m | Planning the next step or initial brainstorming. Reduced amount of time since some of the decisions we are going to delegate to AI. | Decomposing a feature. |
Quick and Scoped AI Generation | 10m | Typically a well understood task, with the scoped and known in advance files to edit. It might be a targeted bug fix, or very small decomposed piece of the feature. | Adding a typical validation to API method. |
Full AI Generation | 20m | A task that might involve multiple changes, like addition of new files, changing existing in multiple places. | Adding a new API method, involving API, tests, logic, DB changes, etc. |
AI Output Review | 8m | Carefully reviewing large AI-generated code blocks for correctness, integration issues, and adherence to project patterns. | Reviewing a complete API implementation with tests. |
UI Corrections | 10m | Fixing styles after AI generated frontend, and looking at the results in browser. | Correcting elements alignment or sizes. |
Reading Documentation | 10m | Reviewing relevant documentation or component/library designs to inform development. | Reading API docs before using a library |
Refactoring | 10m | Improving existing code structure/quality without changing its behavior. | Renaming variables or reorganizing functions |
A note on "idle". It is not forgotten here :) as we are still human, but to simplify we assume that we do other things while AI generates... Now let's try to imagine how the same task would be done if we utilize AI agents for development based on the activities above.
OK Claude, take this file...
- High-level decomposition - Thinking on high-level decomposition x2 (Thinking)
- API development - Throwing a description of the task asking AI agent to build API x1 (Full AI generation)
- API code review - Review the generated API code thoroughly x1 (AI Output Review)
- API fixes - After review, we see some issues that are better to fix right away so we do x2 (Quick and Scoped AI Generation)
- API refactoring - Do some refactoring to finish the API x1 (Refactoring)
- Next step planning - Now let's think on the next step x1 (Thinking)
- Store and client generation - Now asking AI agent to generate API client for UI and a store needed for a feature. Here some experience works, that asking to generate all at once including components might not give stable good results, so we do core prep which can be considered as full generation x1 (Full AI Generation)
- Store code review - Review the generated store and client code x1 (AI Output Review)
- Store preparation - Do some manual refactoring right away to shape the store for the next AI loop x1 (Refactoring)
- Documentation research - Now we need to read some documentation to supply AI correct instructions on the next step x1 (Reading Documentation)
- Data table generation - Now let's ask AI to generate a visual for the data table which is full generation x1 (Full AI Generation). When it finishes it even works... a technology miracle, but there are still things to do.
- Filter UI generation - Right away we ask to add filters on UI which is another full generation x1 (Full AI Generation), it is a bit harder than previous for AI as more small components need to be created, but it tackles successfully heating the temperature in the world a bit.
- UI components review - Review the generated UI components and integration x1 (AI Output Review)
- Behavior fine-tuning - After review we found a couple of things to fine tune from the behavior standpoint so we do x3 (Quick and Scoped AI Generation)
- Visual adjustments - So now we need to tune visuals, seeing results in browser, and since we generated a lot at this stage it is UI Corrections x4 (UI Corrections)
- Component refactoring - Some refactoring of components after all the work x1 (Refactoring)
- Corner case fixes - Further testing reveals some corner cases and performance issues, so we fix it using a couple of quick generation sessions x2 (Quick and Scoped AI Generation)
- Final verification - Small thinking session to verify we haven't missed anything from requirements x1 (Thinking)
Summary
Task Type | Time per Unit | Count | Total Time |
---|---|---|---|
Thinking | 5m | 4 | 20m |
Quick and Scoped AI Generation | 10m | 7 | 70m |
Full AI Generation | 20m | 4 | 80m |
AI Output Review | 8m | 3 | 24m |
UI Corrections | 10m | 4 | 40m |
Reading Documentation | 10m | 1 | 10m |
Refactoring | 10m | 3 | 30m |
Grand Total: 274 minutes = 4.57 hours or ~4.5 hours
What does it tell us?
So we got into x2 speed up from using AI. But is it real?
First of all, let's agree that we need to treat it as a thinking experiment, to make it real we need exactly the same developer doing exactly the same task in parallel realities manually and with AI to ensure that no previous knowledge on the task implementation is utilized. Which is hardly achievable.
The numbers I've used here are essentially educated guesses based on my experience - the 15 minutes for "Runtime Coding" or 8 minutes for "AI Output Review" aren't scientifically measured but rather approximations of how I perceive these activities. This makes the entire exercise more of a framework for thinking about AI impact rather than a proof of specific speedup claims.
Yes, with some scale you can count statistical behavior of similar tasks in similar conditions, like it was done by METR, to study impact of AI on speed of experienced developers on a well known codebase for them. Funny enough, this study shows that developers who know codebase very well, like maintainers for the OSS products, actually getting 20% slower on average.
While in the thinking experiment above it should show x2 speed increase.
This contradiction is telling. It suggests that the reality of working with AI tools is more complex than breaking development into discrete time boxes. The overhead of context management, the cognitive load of constantly reviewing AI output, and the integration challenges of AI-generated code aren't fully captured in my simplified model.
What this exercise does show is that for certain types of tasks, under ideal conditions, significant speedups are theoretically possible. And I want to emphasize that by "tasks" I mean real production features in long-term projects with existing codebases, technical debt, and maintainability requirements - not quick throwaway MVPs or greenfield prototypes where you can be loose with code quality. The constraints of working within established patterns, ensuring backward compatibility, and maintaining code that others will work on for years change the equation significantly.
The key insight isn't the specific 2x number, but rather understanding which categories of work benefit most from AI assistance and which don't.
From this breakdown, it's clear that AI shines most in:
- Boilerplate and repetitive coding tasks
- Initial implementation drafts that humans can refine
- Reducing the time spent on "typing" versus "thinking"
But it struggles with:
- Complex integration scenarios
- Subtle bugs that require deep system understanding
- Tasks requiring significant domain context that doesn't fit in the context window
- Long-term maintainability considerations that aren't immediately obvious
The probabilistic nature of AI output means that the time spent on any given task can vary wildly. Sometimes AI nails it perfectly, sometimes you spend hours debugging generated code that looked correct but had subtle issues. My model assumes the "happy path" more often than reality might deliver. And in production systems, the cost of getting it wrong is much higher than in throwaway projects.
Nevertheless, I still think that meaningful speedups on certain tasks are achievable. The question isn't whether AI can make us faster - it's learning to identify which tasks are suitable for AI assistance and which aren't. This becomes an important skill: knowing when to reach for AI tools and when to code manually, especially when working on systems that need to be maintained and extended over time.
To improve our chances of getting the theoretical benefits, we need all the good setup practices we've used for ages:
- Auto code formatting
- Linting with rules to early find issues
- Types
- Tests on different levels, like unit / integration / e2e / etc
- Task decomposition, spec creation
In addition to the above, considering AI-specific needs:
- Keep codebase modular, so it fits context effectively (microservices strike again)
- All the automation tooling that AI agents will run must be context/tokens friendly, meaning there should be only meaningful info (no more 100500 unresolved linter warnings)
- Speed of feedback matters. Given how much AI can make mistakes and how often all the tools need to be re-run - it's better that feedback is given in seconds, not minutes
- AI-friendly guidelines for the projects themselves
- LLM consumable documentation for internal libraries, processes, etc.
So the conclusion is that with certain tasks, in properly configured environments friendly for LLMs, speedups are theoretically achievable. But it doesn't come for free - we need to invest, improve, or sometimes fully rebuild our processes to get the gains from AI-assisted development. And most importantly, we need to get better at recognizing which tasks are actually suitable for AI assistance versus which ones are better done the traditional way.
The 2x speedup isn't a guarantee - it's a possibility under ideal conditions for specific types of work in production-grade systems. The real value of this exercise is in developing a framework for thinking about when and how AI tools can actually help in real-world development scenarios, rather than assuming they'll magically make everything faster regardless of context.