You're on the hook for performance testing an application. Your executives ask, "Is it fast enough?"
The question is impossible to answer without some key information: What does the customer care? How many users do we need to scale to? And how fast is "fast enough," anyway? To figure this out, we need a conversation.
One of the key points for that discussion is the performance metrics; how fast the system actually is right now. In this article, performance expert Oliver Erlewein and I explore different means for gathering and interpreting those metrics, including common graphs and terminology.
Before you fire up your performance testing tool and gather data…
There are quite a few tests that can be done to improve performance of an application without touching a performance test tool. You will need a browser, some plug-ins, a calculator, server logs and a good old stopwatch.
The Calculator: Take a calculator (or Excel) to what the figures we already have from requirements. This is very situation dependant and is best explained in a simple example:
Take a B2B photo website that caters for professional photographers. They will be up/downloading big 10MP pictures to their accounts. Somewhere in the requirements you read that the website should support 10.000 users. Let's make some assumptions around that. Let's say a 10MP picture is on average 4MB. 400 users are using the system at peak time actively up/downloading pictures. That means we're talking about 1.6GB of data having to traverse the network within a reasonable amount of time. If your servers are connected with a 10Mbit uplink that would mean that kind of traffic would take more than 20 minutes. We doubt that would be acceptable. You can raise a risk for bandwidth immediately and in the next step design a performance test to prove your point.
Server logs: Most Web servers can produce output for every page hit and store them in the server logs. With a little bit of math we can use these logs to find what operations are the most common, but typically developers can also add timers to the logs, showing how long it took the Web server to conduct the operation. With a small perl script we can aggregate this information and report it (more below).
Browser Plugins: If you'd like to do performance evaluation but want something a little more accurate, you might consider a browser plugin, such as Yslow for FireFox. In addition to reporting accurate page download time, Yslow can report time for each file and graphic and also analyze the way the webpage is displayed and suggested fixes to improve download speed.
The Stopwatch: Oh, don't look at me like that. The stopwatch has some real advantages; it's cheap, it's fast, and it's easy. There's no need for instrumentation, no framework or infrastructure to develop. The stopwatch actually measures the full apparent time to the customer, including the idiosyncrasies of the browser and the operating system. The two tradeoffs of the stopwatch are accuracy and cost. When it comes to accuracy, a difference of up to a half a second may not be noticeable, so accuracy might not be that much of a tradeoff. If your team is delivering weekly builds, the stopwatch isn't all that expensive. For continuous testing or monitoring, however, the stopwatch becomes prohibitively expensive rather quickly.
These kinds of experiments can focus your testing or mitigate risks early on. It often precludes or skips the step of "proof by metrics".
Identifying scenarios for performance testing
There are probably dozens of scenarios possibly hundreds. Yet I find that on most modest and medium software projects, there are a small handful of operations that are critical and representative. It may be login, search, create a new page, or maybe saving an update to your user's profile. Or it might be checkout, or some other "transactional" application. Either way, we'll find value in identifying these core operations, then all of the multiple directions in which they can be stretched. For example, search might be very fast for a small database, but around ten thousand users or datasets, the machine suddenly runs out of memory and goes to disk performance then degrades rapidly. When we talk about testing, our typical dimensions are the number of simultaneous users, but that implies a static dataset. With more modern "Web 2.0" applications, the data itself can change over time. So yes, we may want to represent the test results for each operation at one, ten, one hundred and one thousand users. We also might want to have those users stress the system for a week and then look at the test numbers, creating another column on our report.
Reporting performance test results
I'm going to speak of two different types of test results: Formal test results and informal, what actually happened results. We get the first by conducting an evaluation, taking notes, and creating charts and graphs. The second type we pull from the log of a staging (or production) server every day and aggregating my type of operation. Here's one illustrative example, based on login. You'll notice that in addition to simultaneous users, I also mentioned expected supportable customers. Load generating software often uses continuous clicks and human beings don't work that way. We read, we think, we type slowly, we get a cup of coffee and go do other things. So we'll look at the logs and consider the typical pause between operations. Likewise, customers don't live on the application at the same time; a typical Facebook user might only be online for an hour a day. We'll still have some 'peak hours', but for a given number of simultaneous users, there is a higher number of expected supportable customers. How we determine those numbers will vary greatly depending on the application and its users.
|Number of simultaneous users||Expected Supportable Customers||Time in seconds|
If the performance testing is done on a project that is not in production yet and does not have a predecessor application it is necessary to guess at the average pause between clicks. Normal values are 30 60 seconds, longer if there is a lot of readable data involved. If your test shows 1.5 transactions per second, then that would equate to 1.5 x 30s = 45 users or in the case of 60 seconds it would be 90 users using the system.
Performance test tools will give you heaps of data; so much you may feel like you are seeking a needle in a haystack. Yet these metrics are generally based on just three sets of information: What was the request for, when was the request made and how long did it take. From this you can deduce the following key metrics:
n = number of responses
s = the sum of all response times
|Average||s / n (https://en.wikipedia.org/wiki/Average)|
|90th percentile||If you have the sorted list of response times this number is the response time at the position (n * 0.90) the result is usually rounded up. (https://en.wikipedia.org/wiki/Percentile)|
|Runtime||The number of seconds/minutes/hours the test ran should that be variable|
|Min/Max RT||The minimum and maximum response times although their use is very limited|
|Errors||The number of errors that has occurred (only if the responses are verified)|
|Throughput||Throughput here means average requests/responses per second. This is not the kb/MB throughput|
These are the simplest metrics you can get from performance testing. You can calculate these for a series of requests, each request and even redirects. This means you will have lots of numbers very quickly. The issue then becomes analyzing what they mean. These figures challenge not only the stakeholders, but the performance testers. It is therefore advisable to break the numbers down into simpler metrics.
The above metrics somehow start to relate to the requirements. Probably the most common mistake we find in performance requirements though are around numbers. One common ambiguous requirement is that the "system should respond within three seconds." Is that an average? A hard maximum? At what throughput? It is extremely complex to set good performance targets, especially on systems that are not conceived yet. Most of the time it comes down to pure guesswork.
Often requirements refer to averages but taking the average of something can hide extreme results, or a single vast outlier can skew the "average." So instead we recommend a metric we view as underrated, known as the 90th percentile. Of course, 90th is a bit of a standard but 85,90,95 and 99 are common too. Looking at multiple percentiles can also reveal interesting information.
The 90th percentile cuts out the long responses due to errors or other temporary factors. It leaves the tester with a very clear picture of how an application behaves most of the time. Comparing the average with the 90th percentile gives you an impression of deviation of results. So even if you do not calculate the standard deviation (or any comparable figure) this will still give you an impression.
The other important metric is the throughput. I commonly use throughput of functions. The requirement can be something like "with this server we can do three logons per second." This is actually the figure your stakeholder most likely wants when he asks "Is the software fast enough?" This throughput metric can then be used as a high level benchmark for further tests. Additionally you could calculate throughput in kb/s but this is easier to measure server side by the monitoring software that will give you actual network load at point of origin. As we saw above, this throughput figure can then also be put in relation to actual users.
In our experience there are three key graphs needed to evaluate a performance test.
The first and most important one is the results scatter graph.
The scatter graph shows all results (in grey), the running average (thick red line) and the 90th percentile (thin red line). There are many valuable deductions that can be made from this graph that will not be apparent in the number based metrics. You will see things like banding, times with no results, the distribution of results, whether your test is working as expected, whether the average stabilizes and many more. I think that the analysis of this graph warrants an article by itself.
Most commercial tools provide rudimentary graphing capability. Another option is to take the test output and manipulate it with a programming language or import the data into something like Microsoft Excel and generate graphs there.
Next is the throughput graph.
This graph demonstrates throughput (and possible degradation) over time.. We recommend throughput in minute slices but you could just as well create charts on a per second basis. Any dip or bump on this graph implies some hiccup to investigate. Large changes in throughput should be rare if the application under test is running correctly and the test is proven to be stable. So this graph should always look as above if nothing is wrong, thereby verifying benchmark tests. Above you can also clearly see the performance test's ramp up/down phases common to automated tests.
The last type of graph is one that is grouped by how many results there are within a certain response time. Again it is left up to you to choose the best y-axis unit. We recommend 100ms for most commercial applications.
The shape shown in the above graph is what an ideal distribution looks like. Applications that have performance issues show graphs that have a double hump like a camel's. Or they may have a bunch of results far to the right of the graph. From the graph you also get a better idea of your average and average distribution. You can see how many responses are grouped where.
Detailed graph analysis is a topic for another time but the above graphs are your performance testing bread-and-butter. They tell you when something is not quite right. As with all things in life there are certainly more options and other graphs and metrics that show x or y but all the above will catch about 90-95% of your performance issues. Don't be fooled that more is really more. Keep metrics and graphs as simple as possible for as long as you can manage.
One of my favorite exercises is looking at any graph or metric and coming up with multiple reasons it could have occurred. Stakeholders do this too; if they are not told a story about the metrics, they will make up their own. So explaining the test results to the stakeholders is crucial to avoid misinterpretation. For example, Oliver makes a special effort in his test exit reports to explain in detail what every metric means. Every figure will be accompanied by an analysis detailing what this means for the application under test.
Production Monitoring metrics
One area that doesn't get enough attention is monitoring production applications. If every new page or update in a system adds a row to a database, that means the database could become very large. At the same time, it doesn't take many months of 30% growth to move a company from gleefully happy about growth to seriously concerned about systems engineering.
If you still have access to Web server logs, it is straight forward to aggregate performance data and create a report weekly. At SocialText where I work, the engineering team wrote a script to create a report; it runs weekly and creates a wiki (Web) page, with data something like this:
The text at left is a sub-url; a specific operation on the Web server. You'll notice that each column can be sorted, so the team can look at what operations are slowest, look for outliers, or look for what operations are the most commonly called. This allows the team to watch production and find and fix 'hot spots' pro-actively, without doing additional expensive testing for each two-week iteration.
Sooner or later, someone is going to ask "is it fast enough?" I don't particularly like that question, so I try to get out in front of it to have a set of defined performance metrics before the question is even asked. Then, when I get the question, I can show the metrics to the stakeholder and ask"...what do you think?"
About the author: Currently a technical staff member at Socialtext, Matt Heusser has been developing, testing or managing software projects for over a decade. He teaches information systems courses at Calvin College and is the original lead organizer of the Great Lakes Software Excellence Conference, now in it's fourth year. You can read more of Matt's writings on his blog, "Testing at the Edge of Chaos".