Today’s guest post is by RightScale customer Matt Wise, a senior systems architect at Nextdoor, the free private social network for neighborhoods. To join your neighbors on Nextdoor, visit Nextdoor.com and enter your address to get started. If you are interested in joining the Nextdoor team, visit Nextdoor.com/jobs to see open opportunities.
Make It Faster!
Not so long ago, it wasn’t unusual to wait weeks or months to get new hardware into a data center. Fast forward to today and now I’m fielding questions from our Engineering team about why it takes 15 minutes to boot a server.
As an operations team, our goal is to provide the engineers with the best possible tools so that they can launch and manage their own services. We don’t want to get woken up at 5 AM because a PR event on the East Coast causes a surge of traffic to our site and we suddenly need more capacity. Ideally the services should scale up on their own … but even if they can’t, the engineers responsible for a service should be able to scale up and down quickly and on-demand.
It just so happens that our website itself doesn’t usually receive massive traffic spikes, so we’re able to very easily predict load and scale up/down our customer-facing server farms well ahead of demand.
However, our background task processing systems can’t have the same thing said for them. Throughout the day we send millions of emails to our users as content is generated in neighborhoods by other users, city municipalities, and police and fire departments. This content sometimes is broadcast to only a few hundred households, but other times can span tens of thousands of households all at once.
You can imagine a city sending out an urgent alert to a group of neighborhoods with 80,000 members. It wouldn’t do them much good if it took us 30-60 minutes to spin up new capacity before we were able to send out all of those alerts. On the other hand, constantly running enough systems to support bursts like that isn’t cost effective at all. This is why it’s critical that some systems can boot up and begin handling traffic quickly.
Speed Through Caching
Much the same way that app developers leverage Memcache/Redis to temporarily store data that may have taken quite a bit of computational power to generate, at Nextdoor we decided that we wanted to do the same thing with servers. We didn’t want to fundamentally give up on our internal model of configuration-on-bootup, but we knew that we’d have to make some compromises there in order to dramatically speed up boot times.
Today we have a dozen or so unique server farms that scale up and down based on different rules. Some of them scale dozens of machines a day and others may not scale up or down for weeks at a time. It’s the former farms — the ones that scale up dozens of machines per day — that we really care about improving.
Ultimately the model we worked out (details below) allows us to define a normal RightScale server array of instances that configure themselves on-bootup, but then make a conscious choice to cache a boot image that has 99 percent of the setup work already done to it. This allows the server to boot up from a cached state and then just re-configure a handful of bits, install the latest application code and begin processing requests.
The process looks something like this:
Caching Server Images with RightScale
One of our core principles in the Ops team at Nextdoor is that we try to limit the number of technologies we leverage, but become experts in the ones we do use.
We’ve chosen to leverage RightScale as our main cloud management interface and we use Puppet as our configuration management system. We’ve made the choice to be excellent at systems configuration management and leave the cloud management to RightScale (see our previous post for more context).
With that in mind, when it came time to build the system for creating cloud-specific images, we turned to RightScale and worked with them to develop the code. It just made more sense that they write the image caching part of the code, while we drive the system configuration portion of the code.
What we ended up with is an fairly simple RightScale Script that can be executed on a server array (or individual server) that will automatically image the instance, register it as an AMI in Amazon, and then re-configure RightScale so that the next server that boots up leverages this AMI image.
Introducing the EC2 Optimize Image for boot (Preview Release) RightScript written by Cary Penniman at RightScale. This script is easily added to your RightScale managed server and can be executed manually when you are ready to image a host. It works with both S3 and EBS-backed hosts (building either instance-store or EBS-store based AMIs) and automatically reconfigures your server definition so that the next time you launch it, it uses your cached image!
In addition to this script, your servers must be able to configure and re-configure themselves cleanly on bootup. As I mentioned, we use Puppet as our automation engine here at Nextdoor. We took this as an opportunity to greatly improve our Puppet bootstrap scripts from old bash-style scripts and into … gasp … Chef cookbooks!
We’ve published our public cookbooks in a GitHub repo. These cookbooks are built to be as idempotent as possible while also handling the case where prod-webserver1 is imaged and boots up as prod-webserver2.
When Do We Run the Scripts?
For now we have made the choice to explicitly and manually execute the image optimization script on individual server arrays every couple of months. In practice this means that as we continue to make Puppet code changes, our images will become more and more out of date. That’s OK though, it just increases the boot time by a few seconds here and there. At some point in the future, we make a decision to re-create our image by launching a fresh node and imaging it.
Going back to our background task worker example above … it usually takes us an average of 15 minutes to boot a task worker server from scratch using a completely public and basic Ubuntu AMI image in Amazon.
Once we’ve optimized the server array with the script above, our boot time drops to 3.5 minutes. That’s a reduction of boot time by 76 percent. Not only that, but our bootups are significantly more reliable because all of the external dependencies have already been installed!
That means that we could scale up our task workers within five minutes of an urgent alert being sent out and start churning through thousands of tasks extremely quickly. It also means that we can be more aggressive about shutting down unused capacity as soon as we don’t need it!
This article was originally published on the Nextdoor Engineering Blog.
To experience how you can optimize your cloud operations, get a free trial of RightScale Cloud Management.