What can we learn from the latest AWS failure?
By Andy Ormsby
14 Aug 2011
Category:
Business Insights
Amazon Web Services, which is the cloud as far as many people are concerned, has suffered another outage. On August 7th, a power failure in Dublin took out the power for one of the Amazon availability zones (EU-West). The backup power system failed too, which meant rather than just making a seamless transition from the normal mains power to the backup, servers went down hard.
Not only Amazon was affected. Microsoft also runs BPOS (Business Productivity Online Services - hosted Exchange, Office Communications and Live Meeting, amongst others) from Dublin and they were taken out as well. It's perhaps unfortunate that Microsoft touts the service as offering "all-day every-day reliability".
Both Amazon and Microsoft suffered significant outages with several hours of downtime while the backup power was restored.
In Amazon's case, though, EBS (Elastic Block Storage) was hit particularly badly. EBS is great in principle: it provides block level storage that you can use with virtual machines in the AWS cloud. Unlike "instance" storage local to the virtual machines, it persists, preserving data independent of any individual virtual machine instances. It is the nearest thing that AWS has to the kind of shared block storage that enterprise IT customers are familiar with. EBS also provides some of the features that enterprise storage people love, such as snapshots. Snapshots give you a point in time backup from which you can restore easily. As Amazon helpfully explains "In the unlikely event that your Amazon EBS volume does fail, all snapshots of that volume will remain intact, and will allow you to recreate your volume from the last snapshot point."
As with the other infrastructure in Amazon's data centre, EBS servers were taken out by the power outage and the storage volumes they support have taken days rather than hours to recover in some cases. Recovery was still ongoing for some customers on August 10th, three days after the initial event. And somewhere in that process, a bug in Amazon's software deleted parts of some snapshots. So some customers have had their backups destroyed and have lost data Not good.
James Watters @wattersjames rather cruelly tweeted "You tried to use EBS as a cheap SAN didn't you…you know what you got? That's right, a really cheap SAN. Enjoy your Yugo…"
Well, other SANs not commonly compared to Yugos have their problems too. Last year, ZDnet reported [http://www.zdnet.com/blog/btl/virginias-it-outage-doesnt-pass-management-sniff-test/38609] on an outage allegedly caused by a failure of an EMC SAN that took out the State of Virginia's IT for a week. That these kind of failures are rare is beside the point; they happen.
What can we learn from this?
Making services highly available and disaster resistant is hard, whether those services are in the cloud or not. Relying on a single data centre is a really bad idea, regardless of whether the kit in that data centre is owned by you or someone else.
Amazon customers need to think carefully about what the claim that "Amazon Elastic Block Store provides highly available, highly reliable storage volumes" actually means in practice, particularly as there appears to be no SLA for EBS.
For example, Amazon says that "EBS volume data is replicated across multiple servers in an Availability Zone to prevent loss of data from the failure of any single components". So what happens if multiple components fail?
Does this mean the cloud is doomed? No, of course not, but a lot of people I speak to are looking carefully about what they will do next. In cloud as with everything else, you get what you pay for. True high availability takes some effort.
If you are serious about availability, take your suppliers claims of reliability with a pinch of salt. Read your service providers' service level agreements carefully.
Design for failure, assume failures will occur and test your ability to deal with them.
blog comments powered by Disqus