See our new AWS Reboot FAQs for the latest breaking information.
Updated September 25 at 7 PM Pacific Time with additional information.
Today Amazon Web Services (AWS) notified its customers that it will be rolling out an urgent patch to hosts causing a maintenance reboot of EC2 instances over the next several days starting on September 26, 2014, at 2:00 UTC/GMT (September 25, 2014, at 7:00 PM PDT) and ending on September 30, 2014, at 23:59 UTC/GMT (September 30, 2014, at 4:59 PM PDT).
What makes this event different than similar cases in the past, most notably the December 2011 instance reboot, are two characteristics:
- A substantial number of instances will be rebooted (see below for a list of instance types that are not affected). AWS has said that not all instances of the impacted instance types will be rebooted. Update as of September 25: AWS has now said that the less than 10 percent of the instances in the EC2 fleet will need to be rebooted.
- If you relaunch an instance before the maintenance, you are not guaranteed to get an already-patched host.
The second point is really the critical one. Normally, whenever our Ops team receives a maintenance notice regarding a specific set of instances, we relaunch them as soon as possible at our convenience so that by the time the maintenance windows arrives, our instances are already on hosts that have had the maintenance done. This time, due to the scale of the patching, there is not enough patched capacity available to guarantee this.
The net effect is that my recommendation is to do the following:
- Read the details of the maintenance notice you receive from AWS.
- Check the AWS console “Events” page for affected instances. The AWS console will be your most up-to-date source, so don’t depend on email notifications.
- Relaunch these instances as soon as possible in a controlled manner to “snatch up” patched host capacity before others get them. You can also try to relaunch on instance types that AWS says will not be affected.
- Wait a while then check the AWS console to verify that you indeed got patched hosts. Note: AWS has a script running to update the AWS console with maintenance notices on newly launched instances. By midday Thursday Pacific time, AWS hopes to have these notices updating every 1-2 hours.
- If you didn’t, try again a bit later.
- Double-check periodically to make sure no instance is left subject to maintenance.
- Make that you have the appropriate alerts set up.
- Plan to closely monitor your AWS-based applications through the maintenance window.
For instances where a short reboot is safe and acceptable, you don’t need to do anything: They will simply reboot during maintenance (and stay on the same host with the same ephemeral disks and the same IP address).
For databases, if you have set up the recommended master-slave configuration across AZs, you have the option to reboot the impacted AZ ahead of the maintenance window in an attempt to get an instance that is already patched. If that is not successful, you can failover out of impacted AZs ahead of the maintenance window using the following approach:
- Check the AZ of your master and slave.
- Check your AWS console “Events” page for the maintenance timeframe for your master and slave AZs.
- Clone a new slave DB in a new AZ.
- Adjust your master DB and slave DB as appropriate to avoid the maintenance windows and keep a master and slave DB running at all times.
If you do not have a master-slave configuration across AZs and it is critical that you have no downtime of your database, you may want to consider setting up a slave DB in another AZ ahead of the maintenance.
RightScale customers can contact support for assistance with these strategies.
Some of the pertinent details per AWS:
- Not all instance types are affected: AWS has said T1, T2, M2, R3, and HS1 instance types are not affected.
- Not all instances of the affected instance types will be rebooted. AWS has said that the AWS console will provide accurate information on the specific instances that will be rebooted.
- All regions and AZs are affected.
- The availability zones (AZs) in a region will undergo maintenance on different days, so on a given day, if your service is replicated across two AZs, only instances in one AZ will get the reboot on that day. Instances of any given account in different regions will not undergo maintenance at the same time. That is, if you have cross-region replication, both sides won’t go down at the same time.
As usual, AWS is totally tight-lipped about the underlying cause. It seems obvious that the company is patching a security vulnerability, but it will not disclose which one until October 1 — that is, after they have patched all hosts.
We’re curious whether this issue affects other cloud providers as well and how they will react. It will be interesting to see whether some cloud providers’ live migration capabilities allow them to handle the event without visible customer impact.
With regard to the RightScale service, we have started to relaunch instances and do not expect the RightScale Multi-Cloud Platform to be impacted by the AWS reboot. Many of our larger instances are M2s, and therefore are not affected. As a courtesy, we are in the process of contacting all RightScale customers to inform them of this event.
We always recommend that all RightScale customers architect for high availability by leveraging database failover strategies that span availability zones (or regions) and by building redundancy in at all layers. For more information on building resiliency into your cloud architecture, please see Architecting Scalable Applications in the Cloud.
We will be closely monitoring this situation and will post additional details on the RightScale Blog as more information becomes available.