We wanted to provide you with some additional information about the service disruption that occurred in the N. Virginia (us-east-1) Region on October 19 and 20, 2025.
Event Timeline:
Three Distinct Periods of Impact:
Impact Period: 11:48 PM PDT Oct 19 - 2:40 AM PDT Oct 20
Customers experienced increased Amazon DynamoDB API error rates in the N. Virginia (us-east-1) Region. During this period, customers and other AWS services with dependencies on DynamoDB were unable to establish new connections to the service.
The incident was triggered by a latent defect within the service's automated DNS management system that caused endpoint resolution failures for DynamoDB.
Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality. Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region.
Automation is crucial to ensuring that these DNS records are updated frequently to add additional capacity as it becomes available, to correctly handle hardware failures, and to efficiently distribute traffic to optimize customers' experience. This automation has been designed for resilience, allowing the service to recover from a wide variety of operational issues.
In addition to providing a public regional endpoint, this automation maintains additional DNS endpoints for several dynamic DynamoDB variants including:
The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service's regional endpoint (dynamodb.us-east-1.amazonaws.com) that the automation failed to repair.
DNS Management System Architecture:
The system is split across two independent components for availability reasons:
Component 1: DNS Planner
Component 2: DNS Enactor
Normal Operation:
What Went Wrong:
Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening:
The timing of these events triggered the latent race condition:
11:48 PM PDT: All systems needing to connect to the DynamoDB service in the N. Virginia (us-east-1) Region via the public endpoint immediately began experiencing DNS failures and failed to connect to DynamoDB. This included:
Customers with DynamoDB global tables were able to successfully connect to and issue requests against their replica tables in other Regions, but experienced prolonged replication lag to and from the replica tables in the N. Virginia (us-east-1) Region.
12:38 AM (Oct 20): Our engineers had identified DynamoDB's DNS state as the source of the outage
1:15 AM: The temporary mitigations that were applied enabled some internal services to connect to DynamoDB and repaired key internal tooling that unblocked further recovery
2:25 AM: All DNS information was restored
2:32 AM: All global tables replicas were fully caught up
2:25 AM - 2:40 AM: Customers were able to resolve the DynamoDB endpoint and establish successful connections as cached DNS records expired
This completed recovery from the primary service disruption event.
Impact Period: 11:48 PM PDT Oct 19 - 1:50 PM PDT Oct 20
Customers experienced increased EC2 API error rates, latencies, and instance launch failures in the N. Virginia (us-east-1) Region.
Important Note: Existing EC2 instances that had been launched prior to the start of the event remained healthy and did not experience any impact for the duration of the event.
After resolving the DynamoDB DNS issue at 2:25 AM PDT, customers continued to see increased errors for launches of new instances. Recovery started at 12:01 PM PDT with full EC2 recovery occurring at 1:50 PM PDT. During this period new instance launches failed with either a "request limit exceeded" or "insufficient capacity" error.
Key Subsystems:
1. DropletWorkflow Manager (DWFM)
2. Network Manager
11:48 PM PDT (Oct 19): DWFM state checks began to fail as the process depends on DynamoDB and was unable to complete.
While this did not affect any running EC2 instance, it did result in the droplet needing to establish a new lease with a DWFM before further instance state changes could happen for the EC2 instances it is hosting.
11:48 PM Oct 19 - 2:24 AM Oct 20: Leases between DWFM and droplets within the EC2 fleet slowly started to time out.
2:25 AM PDT: With the recovery of the DynamoDB APIs, DWFM began to re-establish leases with droplets across the EC2 fleet.
Since any droplet without an active lease is not considered a candidate for new EC2 launches, the EC2 APIs were returning "insufficient capacity errors" for new incoming EC2 launch requests.
DWFM began the process of reestablishing leases with droplets across the EC2 fleet; however, due to the large number of droplets, efforts to establish new droplet leases took long enough that the work could not be completed before they timed out. Additional work was queued to reattempt establishing the droplet lease.
At this point, DWFM had entered a state of congestive collapse and was unable to make forward progress in recovering droplet leases.
Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues. After attempting multiple mitigation steps, at 4:14 AM engineers throttled incoming work and began selective restarts of DWFM hosts to recover from this situation.
Restarting the DWFM hosts cleared out the DWFM queues, reduced processing times, and allowed droplet leases to be established.
5:28 AM: DWFM had established leases with all droplets within the N. Virginia (us-east-1) Region and new launches were once again starting to succeed, although many requests were still seeing "request limit exceeded" errors due to the request throttling that had been introduced to reduce overall request load.
When a new EC2 instance is launched, Network Manager propagates the network configuration that allows the instance to communicate with other instances within the same Virtual Private Cloud (VPC), other VPC network appliances, and the Internet.
5:28 AM PDT: Shortly after the recovery of DWFM, Network Manager began propagating updated network configurations to newly launched instances and instances that had been terminated during the event.
Since these network propagation events had been delayed by the issue with DWFM, a significant backlog of network state propagations needed to be processed by Network Manager within the N. Virginia (us-east-1) Region.
6:21 AM: Network Manager started to experience increased latencies in network propagation times as it worked to process the backlog of network state changes.
While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation.
Engineers worked to reduce the load on Network Manager to address network configuration propagation times and took action to accelerate recovery.
10:36 AM: Network configuration propagation times had returned to normal levels, and new EC2 instance launches were once again operating normally.
The final step towards EC2 recovery was to fully remove the request throttles that had been put in place to reduce the load on the various EC2 subsystems. As API calls and new EC2 instance launch requests stabilized:
11:23 AM PDT: Our engineers began relaxing request throttles as they worked towards full recovery
1:50 PM: All EC2 APIs and new EC2 instance launches were operating normally
Impact Period: 5:30 AM - 2:09 PM PDT Oct 20
Some customers experienced increased connection errors on their NLBs in the N. Virginia (us-east-1) Region.
NLB is built on top of a highly scalable, multi-tenant architecture that:
The delays in network state propagations for newly launched EC2 instances caused impact to the Network Load Balancer (NLB) service and AWS services that use NLB.
During the event the NLB health checking subsystem began to experience increased health check failures. This was caused by the health checking subsystem bringing new EC2 instances into service while the network state for those instances had not yet fully propagated.
This meant that in some cases health checks would fail even though the underlying NLB node and backend targets were healthy. This resulted in health checks alternating between failing and healthy. This caused NLB nodes and backend targets to be removed from DNS, only to be returned to service when the next health check succeeded.
6:52 AM: Our monitoring systems detected this, and engineers began working to remediate the issue
The alternating health check results increased the load on the health check subsystem, causing it to degrade, resulting in delays in health checks and triggering automatic AZ DNS failover to occur. For multi-AZ load balancers, this resulted in capacity being taken out of service.
In this case, an application experienced increased connection errors if the remaining healthy capacity was insufficient to carry the application load.
9:36 AM: Engineers disabled automatic health check failovers for NLB, allowing all available healthy NLB nodes and backend targets to be brought back into service. This resolved the increased connection errors to affected load balancers.
2:09 PM: Shortly after EC2 recovered, we re-enabled automatic DNS health check failover
Impact Period: Oct 19 at 11:51 PM PDT - Oct 20 at 2:15 PM PDT
Customers experienced API errors and latencies for Lambda functions in the N. Virginia (us-east-1) Region.
Initial Impact:
2:24 AM: Service operations recovered except for SQS queue processing, which remained impacted because an internal subsystem responsible for polling SQS queues failed and did not recover automatically
4:40 AM: We restored this subsystem
6:00 AM: Processed all message backlogs
7:04 AM: NLB health check failures triggered instance terminations leaving a subset of Lambda internal systems under-scaled. With EC2 launches still impaired, we throttled Lambda Event Source Mappings and asynchronous invocations to prioritize latency-sensitive synchronous invocations
11:27 AM: Sufficient capacity was restored, and errors subsided
2:15 PM: We gradually reduced throttling and processed all backlogs, and normal service operations resumed
Impact Period: Oct 19 at 11:45 PM PDT - Oct 20 at 2:20 PM PDT
Customers experienced container launch failures and cluster scaling delays across:
2:20 PM: These services were recovered
Impact Period: Oct 19 at 11:56 PM PDT - Oct 20 at 1:20 PM PDT
Amazon Connect customers experienced elevated errors handling calls, chats, and cases in the N. Virginia (us-east-1) Region.
Following the restoration of DynamoDB endpoints, most Connect features recovered except customers continued to experience elevated errors for chats until 5:00 AM.
7:04 AM: Customers again experienced increased errors handling new calls, chats, tasks, emails, and cases, which was caused by:
Customer Impact:
1:20 PM: Service availability was restored as Lambda function invocation errors recovered
Impact Period: Oct 19 at 11:51 PM - 9:59 AM PDT
Customers experienced AWS Security Token Service (STS) API errors and latency in the N. Virginia (us-east-1) Region.
1:19 AM: STS recovered after the restoration of internal DynamoDB endpoints
8:31 AM - 9:59 AM: STS API error rates and latency increased again as a result of NLB health check failures
9:59 AM: We recovered from the NLB health check failures, and the service began normal operations
Impact Period: Oct 19 at 11:51 PM PDT - Oct 20 at 1:25 AM PDT
AWS customers attempting to sign into the AWS Management Console using an IAM user experienced increased authentication failures due to underlying DynamoDB issues in the N. Virginia (us-east-1) Region.
Additional Impact:
1:25 AM: As DynamoDB endpoints became accessible, the service began normal operations
Impact Period: Oct 19 at 11:47 PM PDT - Oct 20 at 2:21 AM PDT (initial), extended impact until Oct 21 at 4:05 AM
Customers experienced API errors when creating and modifying Redshift clusters or issuing queries against existing clusters in the N. Virginia (us-east-1) Region.
Redshift query processing relies on DynamoDB endpoints to read and write data from clusters. As DynamoDB endpoints recovered, Redshift query operations resumed.
2:21 AM: Redshift customers were successfully querying clusters as well as creating and modifying cluster configurations
Extended Impact: Some Redshift compute clusters remained impaired and unavailable for querying after the DynamoDB endpoints were restored to normal operations.
As credentials expire for cluster nodes without being refreshed, Redshift automation triggers workflows to replace the underlying EC2 hosts with new instances. With EC2 launches impaired, these workflows were blocked, putting clusters in a "modifying" state that prevented query processing and making the cluster unavailable for workloads.
6:45 AM: Our engineers took action to stop the workflow backlog from growing
2:46 PM: When Redshift clusters started to launch replacement instances, the backlog of workflows began draining
4:05 AM PDT Oct 21: AWS operators completed restoring availability for clusters impaired by replacement workflows
Additional Global Impact (Oct 19 at 11:47 PM - Oct 20 at 1:20 AM):
Other AWS services that rely on DynamoDB, new EC2 instance launches, Lambda invocations, and Fargate task launches were also impacted in the N. Virginia (us-east-1) Region, including:
We are making several changes as a result of this operational event:
As we continue to work through the details of this event across all AWS services, we will look for additional ways to:
We apologize for the impact this event caused our customers. While we have a strong track record of operating our services with the highest levels of availability, we know how critical our services are to our customers, their applications and end users, and their businesses.
We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.