Google Compute Engine Performance Test with RightScale and Apica

You’ve reached an archived blog post that may be out of date. Please visit the blog homepage for the most current posts.

With the resources and flexibility cloud computing provides, any organization can run its applications in the world’s best data centers, on the world’s best networks and servers. Organizations looking to leverage cloud infrastructures want to understand how that translates to real-world application performance. We’ve seen growing interest in Google Compute Engine (GCE) among organizations considering a cloud strategy, so we teamed up with Apica, a third-party website testing, optimization, and monitoring company, to test GCE to see what performance consumers of these resources can expect.

In our test, Apica drove traffic to a standard three-tier web application running on GCE. We used RightScale for Google cloud management to configure, monitor, and auto-scale the application deployment.

During the test, we scaled up to 330,000 page views per minute from 200,000 concurrent users, maxing out at 42 servers on GCE during the peak load. To put these numbers in perspective, Evernote states that its application on average receives 150M requests per day. Our testing on the GCE platform nearly doubles the load that Evernote typically experiences.

Test Configuration

The deployment configuration we used in this test was a typical three-tier web architecture consisting of a load-balancing tier, an application tier, and a database. We used WordPress as our test application, but the general architecture and process could apply to any web application.

Load-Balancing Tier

We deployed six load-balancing servers based on a RightScale pre-configured Load Balancer ServerTemplate™, all running on GCE’s n1-standard-8-d machine type. We modified the ServerTemplate to bypass Apache on the load balancer, which can be a resource hog. As no SSL termination was required by our implementation, we handled external access via HAProxy directly.

With eight virtual cores at our disposal on these servers, we ran normal system operations on CPU-0, and tied seven HAProxy processes to CPU-1 through CPU-7 using the nbproc option to HAProxy. The load generated by the Apica test infrastructure was distributed to each of the six load-balancing servers using the standard DNS round robin mechanism.

Application Tier

We implemented the application tier on servers based on the RightScale PHP App ServerTemplate with modifications made to install the PHP-based WordPress application and the Alternative PHP Cache (APC) library for caching to improve performance. The application servers were GCE n1-standard-2-d machines configured in an auto-scaling array.

We configured a RightScale server array to start with a minimum of 15 application servers, a “grow by” value of 5, and a “shrink by” value of 1. In other words, each auto-scaling event in the up direction would add five application servers to the array, while a scale-down would remove one server. Our decision threshold was the standard 51 percent, which means that more than half of the servers had to agree on an action – grow or shrink – before that action would be initiated. Our calm time – the length of time in which votes would be ignored after a scale-up or scale-down event – was set to a very aggressive five minutes, which is outside our best practices (and was one of our lessons learned – more on that later).

Database Tier

We configured the database tier on a RightScale standard MySQL 5.5 Database Manager ServerTemplate. We used only a master database in this test, violating one of our best practices for database applications, but because this was not a production site, and because we did not have read/write splitting implemented on the application servers, it simplified our configuration and was sufficient for the needs of this variation of the test. The database server was running on an n1-standard-8-d machine type, similar to the load-balancing servers.

Rather than implement a caching tier, for this test we used only the caching provided by APC on each of the individual application servers.

The Test Load

Our performance partner, Apica, helped us design a real-world test with four load scenarios:

  1. Browse the home page, select a random page, then select a random article
  2. Browse the home page, perform a search, load resulting article
  3. Browse the home page, open random article, post comment
  4. Browse the home page, log in to site, post article, log out

We used weighted randomization to reflect realistic use characteristics of a high-traffic application – that is, the number of times users browsed pages (a read-only operation) was significantly higher than the number of times users posted comments, performed searches, or added new articles. We generated the load from 80 test servers located in eight different geographic regions of North America.

The test data consisted of 350 blog posts drawn from our own RightScale Blog, with static assets (JavaScript, CSS, and images), served from Google Cloud Storage, which we used as a content delivery network (CDN).

The Test – and the Results

Apica simulated more than 200,000 concurrent users during a testing period of one hour. We applied a ramp period of approximately 20 minutes to provide a realistic load introduction of the 200,000 concurrent users. During the test we were serving up to 330,000 page views per minute, and network throughput was approximately 2.3 Gbps, with 23,000 requests per second. We served more than 20 million page views using a maximum of 42 servers: 35 n1-standard-2-d machines in our application server array, six load balancers, and a database server. Over the course of the test we had four scale-up events, taking our array to its maximum of 35 app servers, and one scale-down event, reducing the application server count to 34 by the end of the test. Let’s take a graphical look at some of the key numbers.

In the figure below, we’ve used RightScale to zoom in on a portion of a single test and annotated it to indicate the timing of specific relevant events. You can see how each part of the process affected CPU utilization.

A single GCE performance test.

The figure below shows 24 hours of CPU utilization of CPU-0 on a typical load balancer. The spikes reflect multiple one-hour test runs. The majority of the CPU was spent in handling interrupts, which is to be expected given that thousands of requests were being sent to the load balancer every second, each of which required CPU cycles to handle the interrupt.

GCE CPU-0 performance.

The next figure shows the CPU utilization of a non-zero CPU on the load balancer during the same 24-hour period, each of which was handling a single HAProxy process.

GCE non-CPU-0 performance.

The graph below shows the interface traffic on a typical load balancer over 24 hours of testing. The majority of the packets were outbound, representing content being served to the clients, with the inbound packets comprising much smaller content requests.

GCE network traffic test results.

We did not pre-warm the cache (which you would almost certainly do in a production environment), so every page request to a new server resulted in a cache miss. At the beginning of the test, as well as at any time a new server was added, the application servers had to request each page from the database. They would then put the result of the query in their own cache so it would be available the next time it was requested, and then return the page to the client, which generated a short burst of network-related errors caused by the lack of cache on the application servers (though the error rate was extremely low – less than 0.1 percent) and a flurry of CPU activity. Once a new server had been running for a short while and had served many requests, its cache was then updated, and at this point the CPU utilization dropped and the errors ceased.

Similarly, every time a test user added a new post to the environment, the cache on all the application servers was invalidated, and there was a mad dash to the database by all the application servers to get up-to-date information. These cache invalidations caused CPU spikes on the application servers (as well as on the database server), which again resulted in a brief spurt of network errors.

Those results reaffirmed what we already knew – that a separate, distributed caching tier is a good thing. All of the errors we encountered can be attributed to cache misses on the individual application servers. With a separate caching tier, there would still be cache invalidation, but just one cache would be invalidated – the one on the separate tier. Each application server after the first one to make the request would find the new content in the cache without being responsible for retrieving and storing it itself. In addition, there would be only one request to the database for each new piece of content, instead of one per application server (up to 35 in our test). In future variations of this test, we plan to add this independent caching tier and compare our results.

Possible Refinements

We learned a few things that should help us improve our testing methodology for the next iteration of these tests. For instance, we chose to forgo a slave database, which is fine for a test bed or proof-of-concept, but would never be advisable for a production environment. As we were using the master database for all reads and writes, a slave would not have contributed in any meaningful way to the test we were executing. In future tests, if we made modifications to the load patterns to generate more database writes, and thus stress the database a bit more, we could use one or more slaves and do read/write splitting to improve our database performance. In the current test, the database was not stressed, so these additions were not necessary.

We also chose not to implement a separate, distributed caching tier. We expected the cache to be a problem (particularly on startup), and wanted to verify our assumptions, which were indeed confirmed. In future tests we plan to use a separate caching tier.

On the configuration side, we used a calm time with a very low value – just five minutes. As it turned out, this value was too low. It typically took about seven minutes for a server to become fully operational, so we had occasions where a second round of voting occurred before the servers that were launched as a result of the previous vote had a chance to enter the load balance pools.

So why did we choose five minutes? During our preliminary tests we found that a base GCE server booted in less than three minutes, but once we configured that server to install Apache, PHP, WordPress, the required plugins, the connection to the load balancer, and all the other necessary accoutrements, servers took about seven minutes before they began handling some of the workload. Our best practice is to set a calm time of “boot time of the server plus a little extra,” which we did not calculate correctly. As a result we ended up with extra servers in the mix until they were scaled down during the next voting cycle. This process had no negative impact on the application, but we incurred some minimal extra expense from the launch of unnecessary additional servers.


During our tests we created a massive load that simulated a real-world application experiencing a planned promotional push or viral-type event using a fairly vanilla configuration with very little tuning or tweaking. Through this process we showed that GCE, managed by RightScale, can help deploy, run, and manage intensely demanding applications on the cloud. Throughout the testing process, GCE exhibited extremely high performance, low complexity, and great flexibility. As a result, I am very excited about what GCE has to offer.

To see for yourself how RightScale can give your organization a powerful solution for Google cloud management, sign up for a RightScale free trial.

We extend a special thanks to Apica for the use of its tools and expertise in building our real-world load test. Apica provides proven technology for optimizing the performance of cloud and mobile applications, and offers cloud-based load testing and web performance monitoring tools to test applications for maximum capacity, daily performance, improved load times, and protection from peak loads.