
The Shipping Throughput Doctrine: Why 180 Percent More AI Code Only Produced 20 Percent More Shipped Software, And The 5-Question Audit That Closes The Gap
There is a number quietly circulating among engineering teams this week that should reshape how every operator thinks about AI productivity.
741 percent.
That is how much more code AI coding agents helped developers write in a new study of more than 100,000 GitHub developers run by MIT and Wharton (LeadDev).
And here is the part that should stop you cold.
That 741 percent code explosion only translated into a 20 percent increase in actual software releases (LeadDev).
Your AI is writing roughly 8 times more raw code. Your business is shipping 1.2 times more product.
That gap, what the researchers call the "weak-link hypothesis," is the most important productivity story of the year and almost no operator is measuring it (LeadDev).
The teams who close the gap this quarter will pull permanently ahead. The teams who keep cheerleading raw output will quietly bleed margin while telling themselves the AI is "10x-ing" the business.
Here is the 5-question doctrine I am running with operators this week.
What Did The MIT And Wharton Study Actually Find About AI Coding Productivity?
The researchers, led by Mert Demirer at MIT, tracked three successive waves of AI tools across more than 100,000 developers on GitHub (Sarah Guo).
Wave one was autocomplete, the original GitHub Copilot.
Wave two was synchronous agents that code alongside developers, like Claude Code.
Wave three was asynchronous autonomous agents that go off and complete tasks on their own.
Each wave drove a bigger lift in code volume.
Autocomplete pushed coding activity, measured by commits, up 40 percent (LeadDev).
Sync agents pushed it to 140 percent.
Async agents pushed it to 180 percent in commits, and 741 percent in raw lines of code (LeadDev).
Then the researchers walked up the org chart to ask what happened to that output.
Pull requests created only rose 65 percent.
Actual software releases only rose 20 percent (LeadDev).
This is the productivity paradox in one chart. The further you walk from the keyboard, the smaller the gain.
Everything downstream of code generation, code review, integration, testing, deployment, release management, still depends on human judgment, human attention, and human bandwidth.
The bottleneck did not get faster. It got worse.
Why Did More AI Code Not Mean More Shipped Software?
Three forces stacked on top of each other.
First, the review tax went up. New Relic's 2026 State of AI Coding report found that 78 percent of organizations report a measurable spike in production incidents directly tied to AI-generated code, and AI code introduces 1.7 times more critical runtime issues than peer-reviewed human code (New Relic).
So every PR now costs more human review minutes, not fewer.
Second, the churn rate exploded. Code churn, meaning lines deleted shortly after they were added, jumped 861 percent in teams that moved from low to high AI adoption (Goncalo Velosa).
Code is being written, merged, and then ripped back out at almost 10 times the previous rate.
Third, the quality gates collapsed. The 2026 Agentic Coding Trends Report found that 60 percent of enterprises are now shipping untested code as AI accelerates output (DEV Community).
You can see the math now.
More code, with more bugs, more churn, less testing, all funneling through the same human reviewers and release managers.
Of course shipped output only grew 20 percent. The wonder is that it grew at all.
What Is The Shipping Throughput Doctrine?
I have spent the last six days walking operators through what to actually do with this data.
Out of those conversations came what I am now calling The Shipping Throughput Doctrine.
It is built on a single principle.
Stop measuring AI productivity at the keyboard. Start measuring it at the deployment.
The unit of value in your business is not lines of code, generated drafts, or completed prompts. It is shipped output that customers receive, use, and pay for.
Code is an input. Releases are the output.
Email drafts are an input. Sent campaigns that convert are the output.
Ad concepts are an input. Live, tested ads that hit ROAS are the output.
The doctrine forces you to redraw the scoreboard.
You measure how much customer-facing work actually crossed the finish line in a week, not how much your AI tools spun up.
And you audit every AI workflow against five questions before you celebrate any productivity claim.
What Are The Five Questions In The Shipping Throughput Doctrine?
Use these on every AI workflow in your business this week.
One. What is the shipped artifact for this workflow?
Not the draft. Not the suggestion. Not the completion. The thing that hits a real customer, a real campaign, a real release. If you cannot name the shipped artifact in one sentence, the workflow does not have a finish line and the AI gains will evaporate.
Two. Where is the human review choke point?
Every workflow has one. For code, it is the PR reviewer. For content, it is the editor. For ads, it is the creative approver. For sales, it is the call review. Map it. Time it. If your AI just made the input side 5x faster, your choke point is now the constraint and every additional dollar spent on AI inputs is wasted until you widen it.
Three. What is the rework rate?
What percent of AI output gets discarded, rewritten, or rolled back? In code that is the 861 percent churn signal (Goncalo Velosa). In copy it is "I had to rewrite the whole thing." In ads it is "the AI variants never beat the control." Track this number. A rework rate above 40 percent means your AI is not productive, it is busy.
Four. What is the new failure mode?
AI code introduces 1.7x more critical runtime issues (New Relic). AI copy introduces brand voice drift. AI ads introduce policy violations. AI customer service introduces hallucinated promises. Every workflow gets a new failure mode the moment AI enters it, and 78 percent of organizations are already feeling production incidents from it (New Relic). Name the new failure mode for each workflow, then design a check specifically for it before anything ships.
Five. What is the shipped-per-week trend?
Take the artifact from question one. Count how many shipped this week, last week, the week before. If that number is not climbing faster than your AI tool spend, you do not have a productivity win, you have a generation paradox.
These five questions are the entire doctrine.
Walk every workflow through them and you will see immediately which ones are compounding and which ones are quietly burning budget.
Why Does The 95 Percent Enterprise AI Failure Rate Confirm The Shipping Doctrine?
MIT's NANDA initiative found that 95 percent of enterprise generative AI initiatives failed to deliver measurable profit-and-loss impact (etcjournal).
This is the same paradox at the company level.
Tools were bought. Pilots were launched. Output was measured.
But shipped, paid-for, customer-impacting work did not move.
The same NANDA work found that tools purchased from outside vendors succeeded 67 percent of the time, while internally built tools succeeded only one-third as often (etcjournal).
The difference was not the model.
It was whether the organization had done the unglamorous work of redesigning the process around the tool. Defining the handoffs. Building the supervisory structure. Closing the shipping gap.
Same lesson, different altitude. The model gets the input cheap. The shipped output is still all process.
Where Should Operators Start This Week?
Pick the one workflow that has consumed the most AI budget this quarter and run it through the five questions tonight.
For most of the operators on my calendar this week, that workflow falls in one of three buckets.
Software development. Your engineers are writing 180 percent more code. Your release cadence is probably up 10 to 25 percent. The doctrine fix is to instrument shipped releases per engineer per week as a board-level metric, and to add a paid AI code review pass between PR and merge so the choke point gets wider, not narrower.
Marketing and content. Your team is drafting 5 to 10x more variants. Your published volume is probably up 20 to 40 percent. The doctrine fix is to count published, performance-tested assets per week as the only number on the wall, and to add a one-page Brand Voice Doctrine the AI must pass before anything reaches scheduling.
Sales and outreach. Your team is sending 3 to 5x more personalized touches. Your booked calls are probably up 30 to 50 percent. The doctrine fix is to count booked, qualified, showed calls per rep per week as the only number, and to add a reply-quality scorecard the AI variants must beat before they enter the cadence.
In every workflow, the doctrine adds one bottleneck-widening investment and one new failure-mode check. That is the whole recipe.
If you want help running The Shipping Throughput Doctrine across your business this quarter, mapping your workflows, sizing the gap between generation and shipped output, and building choke-point fixes that turn the 180 percent into revenue, book a one on one AI Implementation Session here.
We will pick the three highest-impact workflows, run the 5 questions live, and hand you a 30-day plan with shipped-output targets.
TL;DR
- A new MIT and Wharton study of more than 100,000 GitHub developers found AI coding agents drove a 180 percent increase in commits and a 741 percent increase in lines of code, but only a 20 percent increase in actual software releases (LeadDev).
- The "weak-link" bottleneck is everything after code generation, code review, integration, testing, deployment (LeadDev).
- 78 percent of organizations report production incidents directly tied to AI code, and AI code introduces 1.7 times more critical runtime issues than human code (New Relic).
- Code churn jumped 861 percent in teams moving from low to high AI adoption (Goncalo Velosa).
- 60 percent of enterprises are now shipping untested code as AI accelerates output (DEV Community).
- MIT NANDA found 95 percent of enterprise GenAI initiatives fail to deliver measurable P&L impact, and that vendor-built tools succeed 3 times more often than internally built ones (etcjournal).
- The Shipping Throughput Doctrine forces every workflow through 5 questions before counting AI as productive: name the shipped artifact, map the human choke point, track the rework rate, name the new failure mode, watch shipped-per-week trend.
- Operators who close the generation-to-shipped gap this quarter will pull permanently ahead of competitors still cheerleading raw output.
FAQ
Is the 741 percent AI code increase real or marketing hype?
The 741 percent figure comes from the MIT and Wharton study of more than 100,000 GitHub developers tracked across three waves of AI tools, ending with asynchronous autonomous coding agents (LeadDev). It measures lines of code, which is a noisy metric, but the directionally consistent finding across commits, PRs, and releases makes it credible. The gap between that lift and the 20 percent shipped-release lift is the headline.
Does the Shipping Throughput Doctrine apply outside of software?
Yes, and that is the point. The same paradox shows up in marketing, sales, customer support, and operations. Anywhere AI lowers the cost of generating an input, the human-driven step downstream becomes the new bottleneck. The five questions work for any workflow because they translate the principle into a checklist any operator can run in an hour.
How fast can I expect to see shipped-output gains from this doctrine?
In my experience with operators running the doctrine, the first measurable lift in shipped-per-week metrics shows up inside 30 days, mostly from widening one choke point and adding one new failure-mode check. The compounding gain over 90 days typically runs 30 to 60 percent in shipped output per worker, which is dramatically larger than the 20 percent industry baseline because most teams are not running any choke-point investment at all.
Why does AI-generated code cause 1.7 times more critical runtime issues?
New Relic's 2026 report attributes this to AI code bypassing the implicit context that human-authored code typically captures, namely the unwritten conventions of the codebase, the upstream and downstream dependencies, and the failure modes that human reviewers internalize over years (New Relic). The AI writes plausible code that compiles and passes unit tests, then breaks in production where the real distributed-system surface area lives.
Is shadow AI making this problem worse?
Almost certainly. Harmonic Security analyzed nearly 2 million classified AI session minutes and found that two-thirds of activity on personal free-tier AI accounts is work-related (dentro.de/ai). Which means a large portion of the 180 percent productivity lift is happening outside any review or governance system, which is exactly where the failure-mode tax accumulates fastest. Treat shadow AI workflows as workflows, give them a shipped-artifact definition, and pull them under the doctrine.
The teams who pull ahead this year will not be the ones with the fastest AI. They will be the ones who closed the gap between what their AI generates and what their business actually ships.
MIT just drew the gap on a chart. Now go close it.
