Flexera logo
Image: HA/DR and Fault Tolerance with AWS and RightScale

In a session at RightScale Compute last month, Amazon Web Services Solutions Architect Miles Ward talked about architecting in high availability and fault tolerance using AWS and RightScale. He noted that fault tolerance can result from faults in facilities, hardware, networking, code, and people, and defined fault tolerance as the ability of a system to continue operating properly, though perhaps at a degraded level, if one or more components fail. But fault tolerance has to be automated, and that automation has to be tested, so you know to what extent and under what conditions you really are fault tolerant.

Ward said fault tolerance is not binary – there are degrees of risk mitigation and different levels you can implement. Every organization has to determine the value of its applications to determine the level of risk reduction it wants to apply.

All organizations, by using the cloud, gain the advantages of no up-front capital expense and a relatively low-cost and self-service infrastructure that’s easy to scale up and down. Users also only pay for what they use, all of which lead to improved agility and time to market. These same benefits also apply to fault tolerance in a cloud infrastructure.

Many of AWS’ services, including S3 and Elastic Load Balancing (ELB), are inherently fault-tolerant, while others, such as Elastic Compute Cloud (EC2), Virtual Private Cloud (VPC), and Elastic Block Store (EBS), are fault tolerant to the extent that AWS customers choose to architect them. Customers can build in redundancy by geographic area, facility, deployment, resources, and more. Ward also noted that RightScale is especially valuable in managing those types of fault tolerant configurations.

Ward talked about a recovery time objective (RTO) – a time period in which service must be restored to meet business continuity planning objectives – and a recovery point objective (RPO) – an acceptable data loss as a result of a recovering from a disaster or catastrophic event. The two may be at odds, so the goal is to figure out the best RTO/RPO ratio, and in that decision, cost is a huge factor. Application owners must balance the cost and complexity of HA efforts against the risks they are willing to bear.

Best practices for HA that Ward suggested include avoiding single points of failure, using at least two availability zones, replicating data across AZs and backing up and replicating across regions for failover and disaster recovery, and setting up monitoring and alerts to automate problem resolution and failover operations. He advocated designing for failure: Use DNS to support multiple load balancers that send traffic to multiple app servers that use a replicated master/slave database setup that is backed up by S3 spread across two AZs.

One tool Ward highlighted was a new EC2 VPC feature. Elastic Network Interfaces (ENI) can have as many as 16 IP addresses and participate in multiple networks. ENIs let you move virtual NICs from one instance to another.

He also introduced a new tool called Storage Gateway that is designed to promote data availability. Storage Gateway runs on-premise connected to your application servers via iSCSI and replicates as much as 150TB of local data to S3. The data is stored as EBS backups, so you can easily restore them to another on-premise server, or you can create a new EC2 instance and get immediate access to the data in the cloud.

Ward said to mitigate risks, organizations should assess each application and define its target RTO and RPO. Design for failure, starting with the application architecture. When you implement, factor in cost, complexity, and risk. Automating the processes is critical – and again, that’s a place where RightScale shines. You can use dynamic DNS for your database servers with a low time-to-live (TTL) to allow rapid changeover in the event of a failure in a database server and set up automatic connection of your app servers to your load balancers so that they require no manual intervention and no DNS modifications. You can automate promotion of a database slave to a master, but Ward recommends testing that automation vigorously.

In fact, Ward advocated vigorous testing of your entire cloud infrastructure. For instance, he said, you can have a “game day” when half of your staff does nasty things to try to break your testing instances and half work to fix them, while a small group of engineers takes notes on what each side has learned.

Cloud Management icon

Cloud Management

Take control of cloud use with out-of-the-box and customized policies to automate cost governance, operations, security and compliance.

Once you have your automation set up and ready to use, make sure that the decision to use it is manual. Because every instance is different, someone needs to be responsible for determining whether your business situation demands deploying a particular disaster recovery plan.

Ward noted that RightScale’s MultiCloud Images™ (MCI) make it possible to launch instances across regions without modification. Each ServerTemplate™ contains a list of MCIs, and when you create a server, RightScale chooses the appropriate RightImage™ to launch. RightScale gives you nuanced configuration options for the individual parts of the servers that you deploy to aid in the automation process.

Bottom line: AWS provides powerful tools to promote high availability and fault tolerance, and RightScale delivers an advanced set of cloud management features that enable you to take better advantage of what AWS provides.