Monday, December 19, 2011
I've got 99 problems
Sunday, November 20, 2011
Errors and Exceptions at OSDC
Sunday, October 30, 2011
Google+ a place for interesting discussions
Sunday, October 16, 2011
Resilient processes (and erlang)
Monday, May 30, 2011
Elasticness and Clouds
Friday, March 18, 2011
How to waste a friday
From 7:28pm PDT to 9:56pm PDT, a networking issue affected connectivity to a significant number of instances in the US-EAST-1 region. Affected instances experienced degraded network connectivity to the Internet and to instances in other availability zones.
The root cause of last night's issue was when a core network routing device experienced a partial failure. While the router was causing packet loss, the failure was not detected by surrounding network devices and therefore they did not automatically fail traffic over to redundant network paths as intended.
Additionally, our network monitoring tools failed to help our network operators locate the specific source of the connectivity issues. Once our networking team determined the location of the impact, they were able to identify the failing router and manually failed traffic routes away from it. At this point, all affected instances regained full connectivity.
We will be completely replacing the failed network device and our team will work on the failed device to understand the source of the failure. More importantly, we will be working to understanding why our network monitoring did not allow our team to quickly isolate the problem and force the manual failover to redundant network routes. We rely on this monitoring to help us deal with partial failures which defeat the normal redundancy built into high availability network architectures. We understand the impact this event had on some of our users, this just took us too long to figure out, and will be intensely focused on improving our monitoring and addressing the root cause of this failure.
