When Automation Becomes the Enemy: Lessons from the AWS DynamoDB Outage in October 2025

Last week, AWS dropped their postmortem for the massive October 19-20 outage—the one where DynamoDB went down and took out a good chunk of us-east-1 with it. I spent the weekend poring over this unusually transparent report, and figured I’d jot down some thoughts on what this incident says about running modern infrastructure.
The Irony of Advanced Automation
What hit me hardest wasn’t so much the failure itself, but the fact that AWS’s own high-end automation became the biggest enemy. The DynamoDB team built this elegant DNS management system—plenty of safety mechanisms, independent components, careful orchestration. And yet, a rare race condition between two DNS executors turned that sophistication against itself, with the automation actually getting in the way of recovery.
Think about it. The very thing that’s supposed to guarantee high availability ends up blocking your way back to stability. Reminds me of an old distributed systems truth: complexity is the sworn enemy of reliability. Sometimes our cleverest solutions just hand us new flavors of trouble we never saw coming.
From my own years in large-scale ops, I’ve learned that automation without an escape hatch is basically a ticking time bomb. You need what I call a “break glass” protocol—a way for humans to step in and override the robots when things go sideways. AWS’s report mentions they had to intervene manually, but it’s obvious this wasn’t a muscle they’d exercised much. The recovery timeline tells the story.
Chain Reactions Nobody Saw Coming
Reading through the dominoes falling from a DNS hiccup in DynamoDB to a full-blown regional meltdown feels like a masterclass in cascading failure. First, DynamoDB tanks. Then DWFM (which depends on DynamoDB) chokes, so EC2 can’t launch new instances. Next, Network Manager gets jammed up, unable to propagate configs, and starts queueing up tasks. NLB starts flapping because its health checks fail on instances that never got network configs. Finally, Lambda, ECS, EKS, and a whole parade of other services go down like dominoes.
What’s really worth noting is how the very act of recovery triggered new problems. When DWFM tried to reestablish leases with hundreds of thousands of droplets all at once, it hit what the report calls a “congestion collapse”—basically a self-induced denial of service. I’ve seen this pattern in distributed system recoveries before. Sometimes the cure really is worse than the disease.
The fix? Desperate, but kind of brilliant: they selectively rebooted DWFM hosts to clear the backlog and throttled incoming requests. It’s the distributed systems equivalent of “turn it off and on again,” but with surgical precision. Sometimes, the old tricks are the best tricks.
The Forgotten DNS Service
DNS outages are an especially nasty beast, since DNS is the bedrock of modern cloud architecture. We use it for service discovery, load balancing, traffic steering, failover—the works. But how many people ever stop to wonder what happens when DNS doesn’t just slow down or return errors, but simply falls utterly silent?
In AWS’s world, even a single region’s DNS is wrangling hundreds of thousands of DynamoDB records. That’s par for the course at massive scale, but when it goes wrong, the blast radius is huge. The report mentions that even after the DNS records were restored at 2:25 AM, customers had to wait another 15 minutes for cache expiry before things came back. That’s the sinister side of DNS outages—even when you fix the root, you’re still at the mercy of distributed cache propagation.
I’ve started keeping a static IP fallback list for critical services in my own infrastructure work—not pretty, doesn’t scale like a dream, but when DNS bites it, sometimes those hardcoded numbers are your only lifeline. Sometimes you really do need a Plan B.
Health Checks Gone Wild
The NLB health check story here is a classic, and I’ve seen this pattern repeat in production too many times. Health checks are supposed to pull sick nodes out of the rotation, but what happens when the health check system itself gets confused about what “healthy” means?
In this case, NLB was checking instances that hadn’t received network configs yet. The instances themselves were fine, the load balancer was fine, but the health checks kept failing. So you get this endless churn—nodes getting yanked and re-added, services flapping. The health check subsystem, trying to do its job, just makes things worse.
Their fix was refreshingly pragmatic: at 9:36 AM, they turned off automatic failover. Sometimes, the best way to wrangle a misbehaving automated system is to make it less automatic. It takes guts to do that in production—it’s basically flying without a net—but it was the right call.
What Recovery Teaches at the Front Lines
One detail jumped out at me: the wildly different recovery timelines for each service. DynamoDB was mostly back by 2:40 AM, but EC2 didn’t fully recover until 1:50 PM. Lambda cleared its backlog at 6:00 AM, only to get knocked back at 7:04 when the NLB issues killed off part of its backend.
This uneven recovery rhythm shows teams working in parallel, each putting out their own fires. That has its perks, but it also means one team’s fix can accidentally step on another’s toes. When the NLB issues undid Lambda’s recovery, I’m guessing the Lambda folks weren’t thrilled.
The report also lifts the curtain on AWS’s internal dependencies—stuff customers rarely see. For example, Redshift calls the us-east-1 IAM API to resolve user groups in every region. These hidden cross-region dependencies are exactly what turns a local problem into a global one. I’d bet good money AWS is now running an audit to sniff out more of these anti-patterns.
Takeaways for the Rest of Us
So what’s in this for those of us running smaller ops? First up, admit that if even AWS can take a 14.5-hour hit, none of us are immune. The question isn’t if you’ll face a big incident, but whether you’re ready for it.
Start by mapping out your service dependencies. Really map them—not just the obvious ones, but the sneaky, hidden links. Does your service in Region A, for any reason, call out to Region B? Do you have “shared fate” services because they both depend on the same underlying thing? These chains are where cascading failures breed.
Next, think about your automation’s failure modes. Can your automation get stuck in a state that blocks recovery? Do you have manual override hooks? Have you actually tested them? I’m reminded of the old military saying: “No plan survives first contact with the enemy.” Unless you’ve built in escape hatches, your automation won’t survive a novel failure mode either.
Rethink your health checks and auto-recovery mechanisms. Are you willing to pull the plug when these systems go sideways? Do you have enough observability to spot when the cure is worse than the disease? Sometimes, it really is.
And finally, practice your incident response. AWS obviously has top-tier engineers and robust playbooks, but it still took 14.5 hours to fully recover. Part of that is sheer scale, but part is the weirdness of this specific failure mode. The race condition might have lurked for years, waiting for just the right combination. When it finally popped, the teams were learning as they went.
The Humbling Truth
This whole episode is a humbling reminder for anyone who works on infrastructure. It shows that even with world-class engineering, relentless automation, and nearly infinite resources, complex systems still fail in surprising ways. A race condition that’s been sleeping for years finally wakes up. Two DNS executors, running just a hair out of sync, end up deleting the very records they were meant to protect.
If there’s one thing I’ve learned from twenty years in ops, it’s that distributed systems always find new ways to break. Every incident adds to our collective library of “unknown unknowns.” Every postmortem teaches something nobody knew to worry about. AWS’s transparency with this kind of detail is a gift to the whole industry.
As I write this, thousands of engineers around the world are probably auditing their own DNS systems, health check configs, and service dependencies. They’re asking uncomfortable questions about whether their automation is hiding any sleeping dragons of its own. That’s how we get better as an industry—one painful lesson at a time.
Next time someone tells you that modern cloud infrastructure has “solved” reliability, point them to this incident. We’ve made mind-blowing progress, but complex systems always find new and creative ways to fail. Our job isn’t to prevent every outage, but to recover fast and learn from each one. In that sense, this AWS postmortem is a model of good ops—turning a painful outage into a learning moment for everyone.


