Friday, March 18, 2011

How to waste a friday

A unusually verbose writeup on network issues and what they will do at status.aws.amazon.com:

Repeated here for posterity (E-BS is a story for another day, but for now): 

From 7:28pm PDT to 9:56pm PDT, a networking issue affected connectivity to a significant number of instances in the US-EAST-1 region. Affected instances experienced degraded network connectivity to the Internet and to instances in other availability zones.

The root cause of last night's issue was when a core network routing device experienced a partial failure. While the router was causing packet loss, the failure was not detected by surrounding network devices and therefore they did not automatically fail traffic over to redundant network paths as intended.

Additionally, our network monitoring tools failed to help our network operators locate the specific source of the connectivity issues. Once our networking team determined the location of the impact, they were able to identify the failing router and manually failed traffic routes away from it. At this point, all affected instances regained full connectivity.

We will be completely replacing the failed network device and our team will work on the failed device to understand the source of the failure. More importantly, we will be working to understanding why our network monitoring did not allow our team to quickly isolate the problem and force the manual failover to redundant network routes. We rely on this monitoring to help us deal with partial failures which defeat the normal redundancy built into high availability network architectures. We understand the impact this event had on some of our users, this just took us too long to figure out, and will be intensely focused on improving our monitoring and addressing the root cause of this failure.


Posted via email from Michael's posterous