Coding bungle caused massive AWS outage

Janie Parker
March 3, 2017

Amazon has finally revealed the cause of the lengthy outage that disrupted service to dozens of internet services for hours - and it's pretty embarrassing.

The engineer meant to take offline a small subset of servers for debugging, but the command instead took down a much larger group of servers.

While Amazon's S3, shorthand for simple storage service, can handle losing a few servers, the massive number switched off led to chaos.

In its post mortem of the incident published today, AWS revealed the bungle occured during debugging of a problem in its S3 billing system. While S3 was down, a variety of other Amazon web services stopped functioning, including Amazon's Elastic Compute Cloud (EC2), which is also popular with internet companies that need to rapidly expand their storage.

While Amazon's cloud service health dashboard gave no indication of trouble, yesterday morning AWS noted on its Twitter account that S3 was "experiencing high error rates" that the company was working to recover. To make a long story short, Joe's error took down some crucial underlying sub-systems, which removed a significant amount of storage capacity, which caused the systems to restart. The Amazon subsidiary also detailed steps it's taking to prevent similar outages in the future.

Detroit Red Wings trade Brendan Smith to NY Rangers
In 33 games this season, Smith had five points (two goals, three assists) while averaging over 18 minutes of ice time per game. There will likely be more deals to come before the trade deadline now that the Red Wings are very clearly in sell mode.

As a result AWS was forced to do a full restart of the affected systems, during which time S3 was unavailable.

Because numerous S3 servers require others to work properly, the mistake caused a waterfall of outages.

"In this instance, the tool used allowed too much capacity to be removed too quickly", Amazon said. They say they've added "safeguards" to prevent their systems from being taken completely offline and are "reprioritizing" work to improve the recovery time of offline systems. Widespread adoption also increases the likelihood that problems with one service can have sweeping ramifications online.

Second, the company will prioritize a plan to partition the S3 index subsystem, which was originally scheduled for later this year. "We will do everything we can to learn from this event and use it to improve out availability even further". We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions.

Other reports by TheDailyFarc

Discuss This Article