The GoDaddy Web Hosting Performance team
The GoDaddy Web Hosting Performance team is the group within GoDaddy that is responsible for tuning our web hosting environment for peak performance. In order to do so, we measure the response time of a control page loaded on every one of our web servers, every 5 minutes. An Application Index (APDEX) score is calculated for every server, which rolls up to product location, to product to product group. The Hosting Operations Center team has a prioritized list of servers to investigate if a performance issue arises. The company has great situational awareness of the performance levels of its products and services.
Every server has a “control site” that we use for performance monitoring. The control site is a WordPress page on Linux and a DotNetNuke page on Windows. A headless browser simulates a full page load of the control site on a regular basis. Time to first byte and full page load times, and many other interesting metrics, are pushed into graphite.
From the graphite API, an APDEX score is calculated on every server. Listed in table format, sorted by APDEX, the Web Hosting Operations Center team has a prioritized list of servers, updated frequently.
APDEX = (green + yellow/2)/sample_count
APDEX is a number between 0 and 1
Two thresholds on the full page response time determine if each each poll is green, yellow or red. (If you’re doing this for your organization, you have to determine what the acceptable, green, threshold value should be for you, based on a number of factors, including page size, number of requests, 3rd party objects, latency between poller and web server, and more)
For example, if 288 polls were done in the last 24 hours, and say 270 were green, 12 yellow, and 6 red, then the APDEX score is (270+12/2)/288 = 0.958.
Suppose you had these 3 servers and their associated APDEX scores:
In this case, Server C had the lowest APDEX score. The Web Hosting Operations Center performance admins will focus on this server first by drilling into a mountain of other metrics, like CPU utilization, memory consumption, bandwidth, disk i/o, user resource and more. They will diagnose the root cause and apply a resolution.
Rinse and repeat continuously
When the worst server is fixed, then the second worst becomes the worst. Rinse and repeat continuously, and pretty soon, even the worst performing server is doing really well.
When you’re talking thousands of servers and millions of customers, this is a full time job for a whole team of people. When problems are found to be repeating, the details are fed back to the development team in a feedback loop. The development team automates resolutions, reducing man hours required to solve, improving the efficiency of the human operators, and speeding up the time to resolution.
This description of procedure is an example. In reality, the APDEX Scorecard is broken down by product and geo location. There are also graphical views with current state and trends day over day. GoDaddy monitors performance closely and works hard to resolve issues quickly. Performance incidents are big deals at the office, it’s taken very seriously; will work around the clock until resolved.
What performance monitoring approaches do you use?