Amazon S3 was down once again today (Sunday, July 20 at around 12pm EST). “Elevated error rates” began pouring in from the S3 service for the United States and Europe. Several sites reported problems, down sites, and missing images from stored content on the S3 server.
Center Networks began reporting updates of the S3 downtime to their website as they were posted from the AWS Service Health Dashboard. Amazon’s Simple Storage Service, a major component in its online computing services, began reporting rising error rates. At 2pm EST, they discovered that the problem appeared to be an “issue with the communication between several Amazon S3 internal components.”
Several web businesses began to experience severe problems. SmugMug, a photo sharing website, reported that several of its photos and videos were offline during the downtime. Plus, Twitter ran normally, but without any images (as if Twitter didn’t have enough problems on its hands). Allen Stern’s blog also reported images that were missing while S3 was dowe.
Fortunately, by 4pm EST Amazon had reported that it were beginning to restore communication between their additional hosts. By 7:30pm EST, the system was partially restored.
Furthermore, according to ReadWriteWeb, some of the people affected by the downtime were generously forgiving of this technical problem. Allen Stern said, in short, that Amazon had invested a lot of money in this new technology and it’s going to take some time for them to get it right.
The truth is that we cannot do it better than Amazon. They spent a massive amount of money, talent, and most importantly time, trying to solve this problem. To think that this can be replicated by a startup in a matter of months, assembled, be cost effective, and work properly is just absurd.
SmugMug also had similar sediments and empathized with Amazon’s problems.
Every component SmugMug has ever used, whether it’s networking providers, datacenter providers, software, servers, storage, or even people, has let us down at one point or another. It’s the name of the game, and our job is to handle these problems and outages as best we can.
Dealing with internal problem is one thing. However, when that one thing has the potential to ruin your business overnight and it keeps happening over and over again, is it worth it?
One of the greatest things about Amazon S3 is that it offers users strong computing power, storage, and infrastructure, saving time and money for start-up businesses and power users. But for something that’s suppose to be up and running 99% of the time, is it really worth starting a business worth thousands or millions of dollars and putting it on something that’s already been down twice in the same year? What do you think?