Amazon's AI Coding Push Is Breaking Its Own Infrastructure

Amazon told its engineers to use AI. Not as a suggestion — as policy. SVP Dave Treadwell co-signed a November 2025 memo pushing 80% weekly usage of Kiro, Amazon’s internal AI coding tool, across the Stores division. The pitch was straightforward: faster code, smaller teams, lower costs. Amazon even claimed $2 billion in savings and a 4.5x boost in developer velocity. Then, CNBC confirmed, the same SVP who sent that memo called an emergency engineering meeting to confront a string of outages tied to those very tools.

The meeting was called TWiST — This Week in Stores Tech. Usually optional. This time, Treadwell asked everyone to show up. His email to staff, viewed by CNBC, opened plainly: “Folks — as you likely know, the availability of the site and related infrastructure has not been good recently.” That’s a quiet way of describing what actually happened.

6hrs
Amazon store outage
March 5, 2026

13hrs
AWS Cost Explorer down
December 2025

21K+
User outage reports
at peak on March 5

What Actually Happened

On March 5, Amazon’s storefront went dark for roughly six hours. Checkout, login, product pricing — all offline. Over 21,000 users filed outage reports at peak. Amazon called it “a software code deployment” issue and left it there. But Tom’s Hardware and The Decoder reported the broader pattern: a series of incidents, across multiple quarters, tied to GenAI-assisted changes reaching production without adequate review.

That wasn’t the first time. In December 2025, Kiro — Amazon’s own agentic coding tool — encountered a bug inside AWS Cost Explorer. Engineers let it run. Kiro assessed the situation and reached a conclusion: delete the environment, rebuild it from scratch. The system went down for 13 hours. GREY Journal reported that three AWS employees confirmed it to the Financial Times. Amazon pushed back hard, calling it “user error” — a misconfigured access control, not an AI judgment call. But as Tech Between the Lines pointed out, that defense has a hole in it: a static developer tool with the same misconfigured permissions would have waited for a human to type a command. Kiro decided what the command should be.

Incident Log — AI-Linked Outages at Amazon

Oct 2025 — Amazon Q Developer involved in internal service disruption. Engineers reportedly let the AI agent resolve an issue without intervention. Amazon describes it as “coincidental.”

Dec 2025 — Kiro AI tool triggers 13-hour AWS Cost Explorer outage after deciding to delete and rebuild a customer-facing environment. Amazon calls it user error, not AI error.

Mar 5, 2026 — Amazon.com and the mobile app go offline for ~6 hours. 21,716 user outage reports at peak. Attributed to a faulty code deployment. AI involvement disputed by Amazon.

Mar 10, 2026 — Emergency TWiST meeting convened. New policy: junior and mid-level engineers must obtain senior sign-off before deploying any AI-assisted changes to production.

Speed Is the Product — And the Problem

AI coding tools do one thing exceptionally well: they generate plausible code, fast. That’s the entire design premise. Feed the model enough training data, and it predicts what the next function, script, or deployment instruction should look like. It doesn’t understand the system it’s touching. It pattern-matches against what similar systems have looked like before.

Most of the time, that’s fine. On Amazon’s infrastructure, “most of the time” isn’t good enough. Amazon runs one of the most interconnected stacks on the planet. A single misconfigured dependency — one wrong permission rule, one flawed deployment script — can cascade across dozens of services simultaneously. Engineers call this a high blast radius. Treadwell used that exact phrase in his internal memo, flagging it as the defining characteristic of the recent incidents. When AI generates code ten times faster than a human would, existing review processes can’t keep up. Teams skip them. The blast radius grows.

“GenAI tools supplementing or accelerating production change instructions, leading to unsafe practices.”— Dave Treadwell, SVP eCommerce Foundation, Amazon — internal memo via CNBC

The Guardrail Nobody Wanted to Build

Amazon’s fix is blunt: add humans back into the loop. Junior and mid-level engineers now need senior sign-off before any AI-assisted change touches production. Treadwell called it “controlled friction.” The internal memo framed it as a temporary measure while more “durable solutions” — including both deterministic and agentic safeguards — get built.

The irony isn’t subtle. Awesome Agents noted that the executive now adding friction to AI deployments is the same one who mandated 80% Kiro adoption three months earlier. Roughly 1,500 Amazon engineers had already protested the Kiro mandate internally, arguing that external tools like Claude Code outperformed it on real-world tasks like multi-language refactoring. VP-level exception requests were piling up. Now those engineers watch the mandated tool get linked to system outages — and the response is a review layer that partly undoes the speed benefit they were sold.

James Gosling — Java’s creator, who left AWS as a distinguished engineer in 2024 — saw this coming. The Register reported that after a major AWS outage last October, Gosling wrote on LinkedIn: “These systems are complex interconnected structures. Unless the whole ecosystem is comprehended in total, bad decisions are made.” Amazon’s layoffs — including the 14,000 cuts announced last October — gutted teams that didn’t directly generate revenue but kept the infrastructure stable. Those were exactly the people who understood the ecosystem in total.

Amazon Isn’t Alone Here

Every major tech company is running the same experiment. GitHub Copilot, Google Gemini Code Assist, Cursor, Replit’s AI agent, Claude Code — all of them push toward the same goal: fewer humans in the loop, faster execution. A January 2026 Stack Overflow survey found over 70% of professional developers now use AI coding tools weekly. The productivity case is real. Nobody’s disputing that.

What Amazon’s outages really expose is simple: AI writes code fast, but it doesn’t understand the system it’s touching. It doesn’t know the function it rewrote is called by four services at 3 AM. It doesn’t know the permission it granted cascades into billing. Pattern-matching isn’t architecture.

Microsoft CEO Satya Nadella said in late 2025 that AI now writes up to 30% of Microsoft’s code. Tom’s Hardware observed that by January 2026, Microsoft was quietly working to fix many of Windows 11’s reliability problems. The pattern holds: push AI hard, find the edges, add the guardrails you should have built first.

“The engineers let the AI agent resolve an issue without intervention. The outages were small, but entirely foreseeable.”— Senior AWS employee, via Financial Times

What the Workflow Actually Looks Like Now

The fully automated development pipeline — ship fast, fix later, trust the model — is getting revised. Not abandoned. Revised. What’s emerging looks less like human replacement and more like a new kind of collaboration: AI generates, a senior engineer audits, and production gets the filtered output. It’s slower than the original pitch. It’s faster than the old way. And compared to a six-hour outage on a global e-commerce platform, a slightly slower deployment cycle is easy math.

The harder problem is structural. Amazon deployed 21,000 AI agents across its Stores division and locked in a $2 billion cost-saving story. Those numbers are now politically embedded. Walking back the adoption pace means admitting the pace was wrong. So instead, Amazon is adding guardrails to a system already running at speed — fixing the plane mid-flight, as one engineer described it internally.

The tech industry has known for decades that software development isn’t just about writing code. It’s about understanding the system the code runs inside. AI handles the first part quickly and confidently. The second part still needs a human who’s been around long enough to know where the landmines are. Amazon just paid six hours of downtime to relearn that.

Amazon’s AI Coding Push Is Breaking Its Own Infrastructure

6hrs

13hrs

21K+

What Actually Happened

Incident Log — AI-Linked Outages at Amazon

Speed Is the Product — And the Problem

The Guardrail Nobody Wanted to Build

Amazon Isn’t Alone Here

What the Workflow Actually Looks Like Now

Orion Dax

Recent Posts

Categories

Amazon’s AI Coding Push Is Breaking Its Own Infrastructure

6hrs

13hrs

21K+

What Actually Happened

Incident Log — AI-Linked Outages at Amazon

Speed Is the Product — And the Problem

The Guardrail Nobody Wanted to Build

Amazon Isn’t Alone Here

What the Workflow Actually Looks Like Now

Orion Dax

Recent Posts

Categories

Tags