On February 28th, there was a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced a massive internet outage on the East Coast of the US. Turns out it was a problem at Amazon's Hosting services. Amazon doesn't just sell books - they provide internet services for a lot of other companies and websites. They published a post-mortem of the event and it turns out, it was a simple command-line that was entered wrong. From Amazon Web Services:
Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.
And the oopsie:
S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.
They goofed. It is one thing to design and build a system. it is another to regularly test from an absolute worst-case scenario. I have run into this several times with clients and at various jobs I have had. They have a backup system that seems to run properly but they have never run a backup, installed a brand new hard drive (keeping the old one for after the test) and restored from their backup. Most of the times, the restore fails. This is very eye-opening to the client and it allows me to charge them a lot of money to fix their system. The three crucial things for computer backup are:
-
- Test
- Test
- Test
Leave a comment