Capacity planning for Etsy’s web and API clusters

Article Tuesday, November 6 2018

This Article was originally written by Daniel Schauenberg on Code as Craft. Reposted with permission.

Capacity planning for the web and API clusters powering etsy.com has historically been a once per year event for us.The purpose is to gain an understanding of the capacity of the heterogeneous mix of hardware in our datacenters that make up the clusters. This was usually done a couple of weeks before the time we call Slush. Slush (a word play on code freeze) is the time range of approximately 2 months around the holidays where we deliberately decide to slow down our rate of change on the site but not actually stop all development. We do this to recognize the fact that the time leading up to the holidays is the busiest and most important time for a lot of our sellers and any breakage results in higher than usual impact on their revenue. This also means it’s the most important time for us to get capacity estimates right and make sure we are in a good place to serve traffic throughout the busy holiday season.

During this exercise of forecasting and planning capacity someone would collect all relevant core metrics (the most important one being requests per second on our Apache httpd infrastructure) from our Ganglia instances and export them to csv. Those timeseries data would then be imported into something that would give us a forecasting model. Excel, R, and python scripts are examples of tools that have been used in previous years for this exercise.

After a reorg of our systems engineering department in 2017, Slush that year was the first time the newly formed Compute team was tasked with capacity planning for our web and api tiers. And as we set out to do this, we had three goals:

Determine capacity needs
Make it an easily repeatable process
Do it more frequently than once per year

First we started with a spreadsheet to track everything that we would be capacity planning for. Then we got an overview of what we had in terms of hardware serving those tiers. We got this from running a knife search like this:

knife search node "roles:WebBase" -a cpu.0.model_name -a cpu.cores -F json

and turning it into CSV via a ruby script, so we could have it in the spreadsheet as well. Now that we had the hardware distribution of our clusters, we gave each model a score so we could rank them and derive performance differences and loadbalancer weighting scores if needed. These performance scores are a rough heuristic to allow relative comparison of different CPUs. It takes into account core count, clock speed and generational improvements (assuming a 20% improvement between processors for the same clock speed and core count). It’s not an exact science at this point but a good enough measure to get us a useful idea of how to compare different hardware generations against each other. Then we assigned each server a performance score, based on that heuristic.

Next up was the so called “squeeze testing”. The performance scores weren’t particularly helpful without knowing what they mean in actual work a server with that score can do on different cluster types. Request work on our frontend web servers is very different than the work on our component api tier for example. So a performance score of 50 means something very different depending on which cluster we are talking about.

Squeeze testing is the capacity planning exercise of trying to see how much performance you can squeeze out of a service, usually by gradually increasing the amount of traffic it receives and how much it can handle before exhausting its resources. In the scenario of an established cluster this is often hard to do as we can’t arbitrarily add more traffic to the site. That’s why we turned the opposite dial and removed resources (i.e. servers) from a cluster until the cluster (almost) started to not serve in an appropriate manner anymore.

So for our web and api clusters this meant removing nodes from the serving pools until they drop to about 25% idle CPU and noting the number of requests per second they are serving at this point. 20% idle CPU is a threshold on those tiers where we start to see performance decrease due to the rest of the CPU time being used for tasks like context switching and other non application workloads. That means stopping at 25% gives us headroom for some variance in this type of testing and also means we weren’t hurting actual site performance while doing the squeeze testing.

Now that we got the number of requests per second we could process based on the performance score, the only thing we were missing was knowing how much traffic we expect to see in the coming months. This meant in the past – as mentioned above – that we would download timeseries data from Ganglia for requests per second for each cluster for every datacenter we had nodes in. Then that data needed to be combined to get the total sum of requests we have been serving. Then we would take that data and stick it into Excel and try a couple of Excel’s curve fitting algorithms, see which looked best and take the forecasting results based on fit. We have also used R or python for that task in previous years. But it was always a very handcrafted and manual process.

So this time around we wrote a tool to do all this work for us called “Ausblick”. It’s based on Facebook’s prophet and automatically pulls in data from Ganglia based on host and metric regexes, combines the data for each datacenter and then runs forecasting on the timeseries and shows us a nice plot for it. We can also give it a base value and list of hosts with perfscores and ausblick will draw the current capacity of the cluster into the plot as a horizontal red line. Ausblick runs in our Kubernetes cluster and all interactions with the tool are happening through its REST API and an example request looks like this:

% cat conapi.json
{ "title": "conapi cluster",
  "hostregex": "^conapi*",
  "metricsregex": "^apache_requests_per_second",
  "datacenters": ["dc1","dc2"],
  "rpsscore": 9.5,
  "hosts": [
    ["conapi-server01.dc1.etsy.com", 46.4],
    ["conapi-server02.dc1.etsy.com", 46.4],
    ["conapi-server03.dc2.etsy.com", 46.4],
    ["conapi-server04.dc1.etsy.com", 27.6],
    ["conapi-server05.dc2.etsy.com", 46.4],
    ["conapi-server06.dc2.etsy.com", 27.6],
    ["conapi-server06.dc1.etsy.com", 46.4]
  ]
}
% curl -X POST http://ausblick.etsycorp.com/plan -d @conapi.json --header "Content-Type: application/json"
{"plot_url": "/static/conapi_cluster.png"}%

In addition to this API we wrote an integration for our Slack bot to easily generate a new forecast based on current data.

Ausblick Slack integration

And to finish this off with a bunch of graphs, here is what the current forecasting looks like for some of our internal api tiers, that are backing etsy.com:

Ausblick forecast for conapi cluster

Ausblick forecast for compapi cluster

Ausblick has allowed us to democratize the process of capacity forecasting to a large extent and given us the ability to redo forecasting estimates at will. We used this process successfully for last year’s Slush and are in the process of adapting it to our cloud infrastructure after our recent migration of the main etsy.com components to GCP.