The last thing an ecommerce operation wants to hear just before Black Friday is “AWS is down.” But that’s exactly what happened in the early morning hours of Wednesday November 25th, 2020, the day before Thanksgiving and two days before the first major shopping day of the holidays. The fallout of the outage impacted thousands of third-party online services and was finally resolved at 10:23pm PST.
Amazon Web Services (AWS) said the outage occurred in its Northern Virginia, US-East-1, region. It happened during a "small addition of capacity" to its front-end fleet of Amazon Kinesis servers, after all of the servers began to exceed the maximum number of threads allowed by its current OS configuration.
In this episode of Cloud Talk, three Rackspace Technology experts discuss this outage, why it occurred, its impact on businesses and what companies can do to avoid being impacted by future outages. Host Jeff DeVerter, CTO, is joined by Myles Anderson, Vice President of Professional Services, and Ethan Schumann, Senior Manager of Architecture and Engineering.
“They are pretty rare events, but when they happen service outages can be devastating,” said Anderson. “But AWS is a world-class organization and they reacted quickly and fixed this outage.”
During this episode, the host and guests cover a lot of ground on the topic, including:
- Understanding how AWS functions as a services company serving thousands
- Steps organizations can take to avoid becoming a victim of the next major service outage
- How the nature of disaster recovery has changed and why data loss is no longer as relevant
- Why organizations no longer only make architecture changes on nights and weekends
- Why planning for redundancy and availability requires a Zero Day approach
- Why you need to determine the ROI of building highly available systems
One of the key questions after the outage was “why did it take out so many companies?” asked DeVerter.
The answer speaks to “one of the lesser known factoids about how AWS operates,” said Schumann. “AWS runs its own services on its own services. So, it manages its own services the same way it manages its customers’ services. Anytime one of its intrinsic backbone services, like Kinesis, goes down, it has a rippling effect across every customer and a lot of other ancillary services.”
Another factor is that AWS operates within 22 zones across the country. “When you start deploying a solution, you’re beginning with core services,” said Schumann. “They reside within a geographic region. Inside regions you have availability zones. Building redundancy and reliability inside of an availability zone is pretty easy and straightforward.
“Where things get trickier is in trying to avoid a failure of a service that supports an entire region. This requires having solutions available in other regions and to fail them over pretty quickly. However, this could involve replicating data and solutions in more than one region — and there’s not an easy button for that.”
Some organizations might question the timing of AWS’ Kinesis upgrade. After all, updates and changes to data center infrastructure used to be scheduled for the middle of the night.
“It was a gutsy move to make a major change in the middle of the day,” said Anderson. “But that move is very much a product of AWS’ culture. They are about modernizing, including adding more capacity to their availability and increasing their velocity whenever they need to. The November update was supposed to be just another one of hundreds of insignificant capacity changes. They successfully make hundreds a day and we never notice.”