A little oopsie - Amazon

| No Comments

On February 28th, there was a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were suddenly silenced a massive internet outage on the East Coast of the US. Turns out it was a problem at Amazon's Hosting services. Amazon doesn't just sell books - they provide internet services for a lot of other companies and websites. They published a post-mortem of the event and it turns out, it was a simple command-line that was entered wrong. From Amazon Web Services:

Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region
We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

And the oopsie:

S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.

They goofed. It is one thing to design and build a system. it is another to regularly test from an absolute worst-case scenario. I have run into this several times with clients and at various jobs I have had. They have a backup system that seems to run properly but they have never run a backup, installed a brand new hard drive (keeping the old one for after the test) and restored from their backup. Most of the times, the restore fails. This is very eye-opening to the client and it allows me to charge them a lot of money to fix their system. The three crucial things for computer backup are:

    1. Test
    2. Test
    3. Test

Leave a comment

October 2022

Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          

Environment and Climate
AccuWeather
Cliff Mass Weather Blog
Climate Depot
Ice Age Now
ICECAP
Jennifer Marohasy
Solar Cycle 24
Space Weather
Watts Up With That?


Science and Medicine
Junk Science
Life in the Fast Lane
Luboš Motl
Medgadget
Next Big Future
PhysOrg.com


Geek Stuff
Ars Technica
Boing Boing
Don Lancaster's Guru's Lair
Evil Mad Scientist Laboratories
FAIL Blog
Hack a Day
Kevin Kelly - Cool Tools
Neatorama
Slashdot: News for nerds
The Register
The Daily WTF


Comics
Achewood
The Argyle Sweater
Chip Bok
Broadside Cartoons
Day by Day
Dilbert
Medium Large
Michael Ramirez
Prickly City
Tundra
User Friendly
Vexarr
What The Duck
Wondermark
xkcd


NO WAI! WTF?¿?¿
Awkward Family Photos
Cake Wrecks
Not Always Right
Sober in a Nightclub
You Drive What?


Business and Economics
The Austrian Economists
Carpe Diem
Coyote Blog


Photography and Art
Digital Photography Review
DIYPhotography
James Gurney
Joe McNally's Blog
PetaPixel
photo.net
Shorpy
Strobist
The Online Photographer


Blogrolling
A Western Heart
AMCGLTD.COM
American Digest
The AnarchAngel
Anti-Idiotarian Rottweiler
Babalu Blog
Belmont Club
Bayou Renaissance Man
Classical Values
Cobb
Cold Fury
David Limbaugh
Defense Technology
Doug Ross @ Journal
Grouchy Old Cripple
Instapundit
iowahawk
Irons in the Fire
James Lileks
Lowering the Bar
Maggie's Farm
Marginal Revolution
Michael J. Totten
Mostly Cajun
Neanderpundit
neo-neocon
Power Line
ProfessorBainbridge.com
Questions and Observations
Rachel Lucas
Roger L. Simon
Samizdata.net
Sense of Events
Sound Politics
The Strata-Sphere
The Smallest Minority
The Volokh Conspiracy
Tim Blair
Velociworld
Weasel Zippers
WILLisms.com
Wizbang


Gone but not Forgotten...
A Coyote at the Dog Show
Bad Eagle
Steven DenBeste
democrats give conservatives indigestion
Allah
BigPictureSmallOffice
Cox and Forkum
The Diplomad
Priorities & Frivolities
Gut Rumbles
Mean Mr. Mustard 2.0
MegaPundit
Masamune
Neptunus Lex
Other Side of Kim
Publicola
Ramblings' Journal
Sgt. Stryker
shining full plate and a good broadsword
A Physicist's Perspective
The Daily Demarche
Wayne's Online Newsletter

About this Entry

This page contains a single entry by DaveH published on March 7, 2017 10:16 PM.

We The People in the news again - Target was the previous entry in this blog.

Millennial International is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Monthly Archives

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 5.2.9