Setting Up a Fault-Tolerant Site with Amazon’s Availability Zones

You’ve reached an archived blog post that may be out of date. Please visit the blog homepage for the most current posts.

Amazon’s Availability Zones are a fabulous new feature that allows users to assign instances to locations that are fault-tolerant from one another yet that have high bandwidth between each other. I wish I could have done something like that as easily when I was responsible for operations at Citrix Online and we had five datacenters worldwide. As I’ll explain in this post, what Amazon actually provides us is much better than just putting servers into multiple data centers.

The most confusing thing about Availability Zones is the name. In the cloud, what exactly is an “Availability Zone”? The easiest way to think about it is that a zone equals a data center. If power goes out in one data center and the generators fail to start (naah, that never happens) then it doesn’t affect the other data center. Or if there’s a fire, one data center may burn out or be otherwise incapacitated, but others are unaffected. In reality zones don’t necessarily correspond to data centers. Given careful engineering, it’s possible to have multiple “rooms” in a data center that are highly failure-isolated while technically still being part of the same data center; imagine football-sized fields here.

The point of Availability Zones is the following: If I launch a server in zone A and a second server in zone B, then the probability that both go down at the same time due to an external event is extremely small. This simple property allows us to construct highly reliable web services by placing servers into multiple zones such that the failure of one zone doesn’t disrupt the service, or at the very least, allows us to rapidly reconstruct the service in the second zone.

The one caveat to consider when using multiple zones is that there is no free lunch. First of all there’s the speed of light. The zones Amazon is exposing are all on the East Coast (indicated by the names, such as “us-east-1a”). I don’t have inside information about the location of their facilities, but I imagine some may be in New York and others may be in Virginia, so the distance between zones may be considerable, thus translating into some network latency. And even if the actual facilities used by EC2 today are not that far apart, they may be someday in the future.

The second gotcha is that bandwidth across zone boundaries is not free.  Amazon is charging $0.01/GB for what it calls “regional” traffic. This is less than 1/10th the cost of Internet traffic, which seems perfectly reasonable to me. In the days where I was managing multiple data centers, the cost of traffic between them was essentially the same as the cost of random Internet traffic. Actually, it cost twice as much: once to exit one data center and once to enter the other. (Granted, at high volume one can do interesting things to save some money, but it doesn’t become free by a long shot.)

An Example

Let’s see how a simple redundant website looks with Availability Zones and elastic IPs. At the core we’ll have two web servers with Apache and PHP running the web application and accessing the master database. All this occurs in one zone. We’ll allocate two elastic IP addresses that we assign to the two web servers and then create a round-robin DNS entry for our website that maps the domain name to the two IP addresses.

Fault Tolerance with availability zones img1

To ensure the survival of the data in the case of a massive failure, we start a slave database in a second Availability Zone and replicate the data in real time. This is how we’ve set up all our customers to date, except that up until now we haven’t been able to specify the placement of the slave with respect to the master. In the RightScale Dashboard the zone of each server is shown and at server launch time one can select the desired zone.

Now suppose the zone with the web servers and database fails due to a fire. After receiving an alert, we first promote the slave in the second zone to master using the RightScale Manager for MySQL automation. We then launch fresh web/app servers in the same zone as the slave database. Once the promotion completes and the two new servers are up, it is a simple matter of reassigning the elastic IPs to the two new servers to redirect all the users to the new servers, and we’re up and running again.

Fault Tolerance with availability zones img2

The next step is to recreate the redundancy. For this the third Availability Zone that each account has access to comes into play. We start a fresh database slave in the third zone, again using the automation in the Manager for MySQL. Once that comes up and starts replicating we are back to having a redundant setup.

Fault Tolerance with availability zones img4

If you have never tried to set something like this up yourself – renting colo space, purchasing bandwidth,  buying and installing servers – you really can’t appreciate the amount of capital expense, time, headache, and ongoing expense saved by EC2’s features. And best of all, using RightScale, it’s just a couple of clicks away :-).

Beyond the Simple Redundant Setup

You probably noticed that the site described above would go down if there was a failure in the primary zone, which would require a manual restarting of new servers to bring it back up. Some of this can be easily remedied by placing one or multiple web servers into the secondary zone and having them talk to the master DB across the zone boundary. The performance of these servers may be slightly lower due to the inter-zone latency, and there is some cost to the database access traffic. It’s somewhat application-dependent how these play out.

A more sophisticated setup uses load balancers to reduce the impact of the cross-site traffic. The idea is to place one load balancer instance in each zone and route the requests primarily to a set of redundant web/app servers in the primary zone, as shown in the figure below. A third app server can be running in the secondary zone and perhaps get a trickle of traffic from the load balancers just to keep it “warm.” Keeping it warm makes it easy to monitor and ensure that it’s operating properly.

Fault Tolerance with availability zones img3

The good thing about this setup is that the traffic shipped across the zone boundary is exactly the same as comes into the second load balancer. This means that for half the total Internet traffic there is a $0.01/GB surcharge, which results in less than 5% extra cost overall. (This is not counting the DB replication traffic.) Also, the extra latency from one zone to the other is negligible when compared to the already incurred Internet latency.

In the case of a primary zone failure, browsers will fail over to the load balancer in the remaining zone; this is a feature built into web browsers related to the round-robin DNS setup. The load balancer will direct all traffic to the third web/app server. At that point the secondary database needs to be promoted to master and the third app server repointed to that database and everything will be back up and running. With automation the DB promotion could be done automatically, but it’s better to be conservative; a promotion due to a false alert could cause a lot of harm.

This second setup is a bit more complicated than the previous one, but it requires less machinery and no server launches in the case of a failure. It also requires one extra machine if one assumes that each load balancer can run on the same instance as a web/app server (typically not a problem). Many more variants on this basic setup are clearly possible and should be considered on a case-by-case basis.

It’s mind-boggling how much power Amazon is giving us in designing sophisticated distributed redundant Internet services! In combination, the Availability Zones, the elastic IPs, and the overall programmatic control over all the resources make the cloud a superior environment for deploying sophisticated Internet services. At RightScale we’re hard at work to incorporate the new features into our standard deployment templates so all of our customers can take advantage of the new features in their deployments. We’re also automating a number of the failure scenarios so that you don’t need to have an alert wake you up if there a fire at Amazon in the middle of the night.