Akamai acquires Fermyon to combine WebAssembly function-as-a-service (FaaS) with Akamai’s globally distributed platform. Read news

When the Cloud Breaks: Lessons from the AWS Outage

Sandra Murphy

Oct 27, 2025

Sandra Murphy

Sandra Murphy

Written by

Sandra Murphy

Sandra Murphy is a Product Marketing Manager on the Infrastructure Marketing Team who helps promote Akamai’s portfolio of security solutions. Her previous experience includes marketing and product management roles in the telecommunications industry.

Share

Executive summary

  • On October 20, 2025, a major outage in Amazon Web Services (AWS) US-East-1 (Amazon’s largest and most strategically important cloud region) took down digital services for many major brands, including Netflix, Atlassian, Coinbase, Slack, and Expedia.

  • Engineering teams worldwide found themselves in an impossible position. Their monitoring dashboards showed failing services, and customer support channels flooded with complaints, but the AWS management consoles that they needed to diagnose and remediate the issues were either unreachable or displaying stale data. Manual failover procedures documented in runbooks assumed access to AWS APIs that had become unavailable. 

  • The failure exposed a critical architectural vulnerability: When a cloud provider's foundational service fails, even well-designed multiple availability zone (multi-AZ) architectures offer no protection. 

  • This blog post explores what went wrong, why traditional resilience strategies failed, and how organizations must architect for cloud survivability, not cloud dependency.

AWS outage: What happened (as far as we know) and why it matters

The outage began late in the evening of October 19, 2025, when AWS experienced a major outage in its US-East-1 region due to a failure in DNS management for DynamoDB, one of AWS’s core database services. The issue was not triggered by an attack or network overload but was caused by a latent defect inside AWS’s own DNS management system.

AWS uses automated systems to manage DNS records for services like DynamoDB. These records ensure that applications know how to connect to the correct service endpoints. DynamoDB manages hundreds of thousands of DNS records across its infrastructure, and those records are constantly updated for availability and failover.

However, a rare timing issue between two redundant components caused the automation to delete the DNS record for the DynamoDB regional endpoint, leaving it with an empty DNS record. Once that record was removed, anything relying on DynamoDB in US-East-1 could no longer locate or connect to the service.

Why DynamoDB matters to the broader AWS ecosystem

DynamoDB isn’t just a database inside AWS; it functions like a central coordination system and is used by dozens of internal platform services to track states, configurations, and operational metadata.

When DNS issues made DynamoDB unreachable, it triggered a race condition.

First, EC2 orchestration systems stalled. Behind the scenes, EC2 stores information about every virtual machine instance — when it launches, where it lives, how it connects — in DynamoDB. With DynamoDB offline, EC2 couldn’t register new instances or update health status.

That failure then spilled into AWS network load balancers. These load balancers rely on EC2’s state to know which servers are healthy and available to receive traffic.

But with EC2 unable to update that state, health monitoring started to misfire and load balancers began flagging healthy servers as unavailable because their metadata couldn’t be refreshed. Capacity was removed from rotation not because it failed but because the control plane lost visibility.

As more services attempted to recover, they ran into the same brick wall; internal authentication systems, orchestration tools, global API endpoints — all of them depended on DynamoDB indirectly.

That’s how the outage spread so widely; a single DNS misconfiguration broke DynamoDB, which in turn impacted EC2 subsystems, which confused network routing and crippled dependent services across AWS.

The real lesson: Foundational service failures cascade unpredictably

Despite implementing multi-AZ architectures and following cloud best practices, many organizations run production workloads in a single AWS region: N. Virginia (US-East-1). When that region experiences failures, the impact radiates across the internet. The fundamental problem isn't the availability zone strategy — it's the assumption that a single cloud provider represents a resilient foundation. In reality, dependence on a single cloud provider is a single point of failure. 

Foundational service dependency: The invisible single point of failure

The AWS outage revealed a vulnerability that availability zone strategies cannot address: dependencies on foundational services with failure that cascades through the entire control plane. While organizations focus on protecting against infrastructure failures (servers crashing, disks failing, network links dropping), the AWS outage demonstrated that foundational service failures are equally vulnerable and far more devastating.

DynamoDB's DNS failure didn't just affect applications using DynamoDB directly. It disabled the control plane systems that manage EC2 instances, authenticate users, process Lambda functions, and propagate network configurations. Traffic continued flowing to unreachable endpoints because the systems responsible for making routing decisions had lost access to their foundational data store.

Organizations discovered that their resilience strategies assumed foundational services would remain accessible. Health checks, auto-scaling, automated failover, even manual remediation through AWS consoles all depend on foundational services operating correctly. When they don't, resilience mechanisms become unavailable exactly when they're needed most.

What this means for your architecture

Traditional cloud resilience strategies assume foundational services like DNS, identity management, and core data stores remain functional during failures. This assumption is fundamentally flawed. Your architecture must include external monitoring and failover capabilities that operate independently of your cloud provider's foundational services.

DNS: The underestimated cascading failure vector

DNS didn't cause the AWS outage directly, but it amplified the damage across every layer of infrastructure. When DynamoDB's DNS record was deleted, the cascading failure pattern touched every dependent system: service discovery stopped working, internal routing broke down, health monitoring systems lost visibility, and failover workflows couldn't execute.The lesson isn't that DNS failed in isolation. It's that DNS sits at a critical dependency layer where foundational service failures cascade into application-layer failures. 

Organizations treat DNS as solved infrastructure, assuming their cloud provider's DNS automation will function reliably. The AWS outage proved that even well-designed DNS automation systems can have latent defects that only manifest under specific conditions.

What makes DNS failures particularly dangerous is their silent propagation. AWS's DNS defect existed in production for an extended period but never manifested until specific timing conditions aligned. When it did trigger, the failure cascaded through dependent systems faster than automated recovery mechanisms could detect and respond.

What this means for your architecture

DNS must be treated as both a security layer and a resilience mechanism. You need continuous visibility into DNS infrastructure across all providers, the ability to detect misconfigurations and anomalies before they cascade, and DNS resolution that remains independent of any single provider's automation systems. 

If your DNS depends entirely on your cloud provider's infrastructure, you have a single point of failure that can disable your entire service ecosystem.

When failover doesn't fail over

Many resilience plans are designed for ideal scenarios, not failure-mode reality:

  • Organizations discover that their failover mechanisms are tightly coupled to the same AWS services that fail. 

  • DNS records propagate too slowly to enable rapid recovery. 

  • Backup regions are not tested under production load. 

  • Runbooks assume manual intervention would be possible — but teams can’t access the management consoles needed to execute those procedures because authentication systems depend on the failing DynamoDB service.

The AWS outage exposed the uncomfortable truth: Many organizations learned their "resilience strategy" was theoretical rather than operational. When the cloud provider's foundational services failed, the automated systems designed to maintain service continuity failed alongside it.

What this means for your architecture

If your failover depends on your cloud provider's infrastructure remaining accessible, you don't have failover — you have dependency. True resilience requires external control systems that can detect failures and reroute traffic without relying on the failing provider's APIs, consoles, or foundational services.

The blueprint: How modern teams engineer cloud survivability

Engineering for cloud survivability requires deliberate architectural decisions that prioritize optionality and independence. The organizations that survived the AWS outage with minimal impact share common patterns — patterns that reflect a fundamentally different approach to resilience.

Foundational service  independence

The AWS outage demonstrated that resilience strategies must account for core service failures, not just infrastructure failures. Organizations need monitoring and failover logic that operates independently of their cloud provider's foundational services. This means:

  • External health monitoring that doesn't rely on cloud provider APIs or services that might depend on failing foundational infrastructure

  • Traffic management decisions that are made outside the failing cloud environment using independent control systems

  • Failover triggers that function even when cloud provider consoles are unreachable due to authentication failures

  • Observability that maintains visibility when cloud provider metrics depend on services experiencing cascading failure

When AWS's DNS resolution for DynamoDB  failed, teams couldn't access management consoles.  The only organizations that maintained service continuity had external monitoring and traffic management systems that are capable of detecting failures and rerouting traffic without depending on AWS's infrastructure.

A critical question to consider: Can your monitoring systems detect a foundational service failure and trigger failover without accessing your cloud provider's APIs or dashboards? If your DNS resolution, authentication, or monitoring depends entirely on your cloud provider’s infrastructure, your resilience strategy has a critical visibility gap.

Resilient architecture by design

Resilience begins with the ability to move workloads when infrastructure fails. Organizations that deployed across multiple regions and multiple cloud providers maintained service continuity during the AWS outage while single-cloud architectures struggled.

Provider-neutral architectures (using container orchestration like Kubernetes) combined with external traffic management enabled these teams to detect the AWS outage and redirect users to healthy infrastructure across different cloud providers.

The critical test: If you can't redirect production traffic away from a failing cloud provider in five minutes or less, your architecture isn't resilient. It's captive.

DNS posture management as a security discipline

DNS must be treated as both a security layer and a resilience mechanism. Organizations need continuous visibility into their DNS infrastructure to identify stale records that create vulnerability, monitor configuration changes across the entire DNS estate, eliminate shadow DNS exposure, and ensure that name resolution remains independent of any single provider's automation systems..

The AWS outage demonstrated that DNS failures don't announce themselves with clear warnings. Latent defects in DNS automation can exist in production systems for extended periods, only manifesting under specific timing conditions. When they do trigger, they cascade silently through dependent systems until services stop responding. Proactive DNS posture management detects these risks before they trigger outages.

Organizations need answers to critical questions: 

  • Which DNS records point to single cloud regions? 

  • Which services depend on DNS resolution paths that travel through a single provider's infrastructure? 

  • Where are the stale CNAME records that could become subdomain takeover risks during failover scenarios?

  • How quickly can you detect when a DNS record has been deleted or corrupted?

Intelligent traffic management eliminates manual failover

Automated failover isn't optional — it's essential. Manual runbooks fail when teams can't access the management consoles needed to execute procedures. 

Effective traffic management requires health-based steering that validates endpoint availability before routing users, global failover capabilities that aren't constrained by regional boundaries, and intelligent routing based on both performance metrics and reachability data.

The AWS outage proved that organizations need traffic management systems that operate independently of the infrastructure they're routing around. When cloud provider APIs become unavailable, your traffic management system must continue functioning — by detecting failures, validating alternative endpoints, and steering users to healthy infrastructure without human intervention.

Build infrastructure that survives cloud failures

Akamai's approach addresses the core vulnerabilities exposed by cloud outages. These capabilities provide the external control layer that organizations need when cloud provider infrastructure fails, including:

  • Furnishing multicloud routing 

  • Delivering always-on authoritative DNS 

  • Identifying misconfigurations before they become outages 

  • Enabling real-world failover without manual intervention

Furnishing multicloud routing

Akamai Global Traffic Management provides multcloud routing that intelligently directs traffic based on real-time availability and performance data. When a cloud region fails, Global Traffic Management detects the degradation and reroutes traffic to healthy infrastructure — all without depending on the failing provider's APIs or control systems.

Delivering always-on authoritative DNS

Akamai DNS delivers always-on authoritative DNS with massive anycast scale, ensuring name resolution continues even when cloud provider DNS fails. During the AWS outage, organizations that are using Akamai DNS maintained the ability to resolve service endpoints and reroute traffic, even as AWS's internal DNS struggled.

Identifying misconfigurations before they become outages

Akamai DNS Posture Management identifies misconfigurations before they cascade into outages. Continuous monitoring across your entire DNS estate — regardless of provider — ensures stale records, improper delegations, risky configurations and unexpected changes are detected and remediated before they become failure vectors during an incident.

Enabling real-world failover without manual intervention

Built-in health monitoring and automated routing enable real-world failover without manual intervention. Akamai's health checks validate endpoint availability independently of cloud provider health systems, ensuring that traffic management decisions are based on actual reachability rather than potentially stale provider data.

Akamai doesn't replace cloud providers. It provides the control layer that maintains business continuity when they fail; that is, the external, independent decision-making system that keeps your services reachable even when your cloud provider can’t. 

Adopting a multi-cloud fallback to an alternate cloud for critical workflows is part of a sound resilience, disaster recovery, and risk mitigation strategy. 

Akamai Cloud offers various services that can serve as primary or secondary functions in your application stack. If AWS experiences an outage, your object storage/backup strategy should not be tied to a single cloud — as it was for many businesses who relied on AWS DynamoDB. With Akamai Object Storage you gain an alternate cloud target with S3-compatible interface. You can’t outsource resilience.

Cloud outages aren't anomalies — they're operational realities that will only increase as digital infrastructure becomes more complex. The October 2025 AWS outage proved that core service failures expose architectural vulnerabilities that traditional resilience strategies don't address.

The question facing your security and platform leaders isn't whether your cloud provider will experience failures. It's whether your architecture can detect those failures and maintain business continuity when the cloud provider's own systems cannot.

Resilience isn't a product you purchase or a service you outsource. It's an architectural discipline you engineer and enforce across four critical layers: foundational service independence, cloud independence, DNS resilience, and intelligent traffic management. Organizations that survived the AWS outage with minimal impact had these capabilities before the failure occurred.

You can't prevent cloud providers from failing. But you can architect infrastructure that doesn't fail with them.

Get started

Building cloud-independent resilience doesn't have to be overwhelming. Contact your Akamai representative today to discuss how DNS Posture Management, Global Traffic Management, and Akamai DNS can strengthen your infrastructure against cloud failures and protect business continuity when major providers experience outages.

Sandra Murphy

Oct 27, 2025

Sandra Murphy

Sandra Murphy

Written by

Sandra Murphy

Sandra Murphy is a Product Marketing Manager on the Infrastructure Marketing Team who helps promote Akamai’s portfolio of security solutions. Her previous experience includes marketing and product management roles in the telecommunications industry.

Tags

Share

Related Blog Posts

Security
The 8 Most Common Causes of Data Breaches
April 19, 2024
Discover the primary causes of data breaches — and how to protect your organization from these pervasive threats.
Security
AI Pulse: How AI Bots and Agents Will Shape 2026
January 12, 2026
Read our reflections on AI bot traffic across the Akamai network in 2025 and get our predictions for how these trends will shape agentic commerce in 2026.
Security
Protecting Small and Medium-Sized Businesses from Cyberthreats
October 27, 2023
The cyber exposure of small and medium-sized businesses transcends their size. So, Akamai is partnering with Comcast Business to help protect SMBs from threats.