Testing for performance, part 2: Build out the test assets

In this second article of our three-part series on testing for performance Michael Kelly looks at how to build test assets and the work required to support that effort.

Mike Kelly, software tester
Mike Kelly

In this second article of our three-part series on testing for performance we look at building our test assets and the work required to support that effort. As a reminder, the Testing for Performance series is broken into the following parts:

  • Assess the problem space: Understand your content, the system and figure out where to start
  • Build out the test assets: Stage the environments, identify data, building out the scripts and calibrate your tests
  • Provide information: Run your tests, analyze results, make them meaningful to the team and work through the tuning process

As we look at the various environments, tools, data and scripts we use to build our tests, we will try to tie those artifacts back to the work that we did in the first article when we assessed the problem space. Often, as we build out the test assets, our understanding of the problem space changes, becoming more (or sometimes less) detailed, and often we are required to reconsider some of the decisions we made before we got down into the details.

Understand and manage your tools and test environments
The first thing we need to look at, normally before starting to build out any test assets, is the environment and tooling. I've seen very talented performance testers present results that weren't useful because they tested using the wrong version of the application. I've also seen project teams identify a performance problem only to discover that they had no way to isolate and debug it. And, unfortunately, I've been part of a performance testing team where we couldn't execute our tests on time because our performance tool environment couldn't go more than a day without throwing errors or losing results.

In each of those instances, the team let one of three key environment factors get away from them. They didn't correctly identify their needs up front, didn't build or buy the right tools for their problem, miss-configured something, or failed to coordinate their testing. The three key areas where I've seen teams fall apart include managing the system/application being tested, managing the tools to support monitoring and debugging, and managing the tools to support performance test execution. For each of those areas, you must identify what you need, install the required software and hardware, configure it to fit your project and then coordinate how it gets used going forward.

In the following table I present one way of thinking about how to manage your tools and test environments. It's only an illustration of some of the questions you might ask.

Depending on what you are performance-testing and how you're testing it, you may have a much larger list. Building a matrix like this up front may help you and your team members think about everything you're going to need to be successful. Some of these questions will be answered with purchases, others with processes, and others may be left unanswered (and that may be OK).

Identify, create, and manage your test data
In the article Using IBM Rational Functional Tester to test financial services applications I wrote about five ways to select data for automated testing: based on risk, based on requirements, based on availability, using production data, or using randomly generated data. Below I modified each of those data selection techniques for performance testing. As you think about your performance test project, think about how each of these data selection techniques can support different types of tests you have planned.

Selecting data based on risk: When you identify risks, you consider what can go wrong. I find Scott Barber's FIBLOTS mnemonic for workload models to be helpful when thinking about performance testing risks. Each letter of the mnemonic helps us think about a different aspect of risk. Here's a summary of FIBLOTS applied to test data:

  • Frequent: What data is most frequently used (examples: login ids, reference data, etc.)?
  • Intensive: What data is most intensive (examples: wildcard searches, large file uploads, values requiring conversion, etc.)?
  • Business critical: What data supports processes that represent the business that needs to work (examples: month-end processing, creation of new accounts, etc.)?
  • Legal: What data supports processes that are required to work by contract?
  • Obvious: What data supports processes that will earn us bad press if they don't work?
  • Technically risky: What data is supported by or interacts with technically risky aspects of the system (examples: new or old technologies, places where it has failed before, etc.)?
  • Stakeholder-mandated: What data have we been asked/told to make sure we test?

Selecting data based on test scenarios: You'll need to select data that allows you to test the specific scenarios identified by your usage models and test scenarios (from part one of this series). If your application has different roles, what data do you need to exercise each role? The scenarios you outline in your modeling phase may need variation across the user population. If they do, make sure you understand what variety is required and where you'll get the different permutations on that data. You'll need to know that when it comes time to create your test scripts.

Selecting data based on availability: You may want to select data that's readily available. This could be production data (discussed below) that's in an easy-to-access format, data from past iterations, spreadsheets used by manual testers for your project, data from other projects or teams in your company, or data from some data generation source (also discussed below). The idea here is that if the data is easily accessible, as well as usable and meaningful, then including this data in your performance testing can save time and money. I emphasize usability and meaningfulness because it's important that you don't select data just because it's there and ready to be used.

Using production data: Depending on the type of performance testing you're doing, another strategy to gather test data is to use production data. Although you may not want to rely solely on this type of data, it can be one of the richest sources of scenarios for performance testing. That's because the data is representative of real scenarios the application will face and because it will most likely provide a high number of scenarios. Running a production data set under different conditions can also be a useful way to tune and debug the system from a performance perspective.

There are some caveats about using production data. Production data will most likely not contain many of the special cases you'll want to test for, and it's not a replacement for well-thought-out test scenarios. Depending on your context, there could also be some legal issues surrounding the use of your production data.

Using data generation: Many performance-testing tools include test data generators. Data generators can be especially helpful in generating large sets of data that all need to match a specific pattern. Many tools support generating data using regular expressions. Even if they don't, you can use a simple scripting language like Ruby to create the data outside of the performance test tool and just read it in at runtime.

Once you've identified your data, think about what format you need that data to be in. Will your scripts read the data from XML, text files, spreadsheets, data pools, databases or some other format? Will you have to perform any conversions from one format to another before you can use it? You'll also want to think about how you'll want to manage your data. Will there be multiple versions of it floating around? Will you ever want to merge datasets?

You'll also want to identify any data dependencies. Does data in one area of the application depend upon data in another? Will you have to maintain that dependency in your script? Are certain scenarios destructive to the test data, meaning you're working with a finite set until you repair the data or repopulate it? If you have to clean up your test data, how will you do that?

As you identify your dependences, make sure you develop the tools and processes you need to support the data going forward. In between your first and second performance test is the wrong time to realize you needed to backup your database before the test if you want to restore it. It may also be the wrong time to contact the database administrator for the first time expecting that type of support.

Code your test scripts
Before you start recording or coding your performance test scripts, take some time to think about what information you want to get from your scripts and what information you want to get from your monitoring tools. For any information you want your scripts to provide, you're probably going to want a timer of some sort.

Many tools put timers on every transaction by default. If your tool does not, then you might consider adding them. You'll also want to think about what transactions you'll want to aggregate across multiple scripts. For example, if you have 10 scripts that each order a book online, not only will you want unique names for the timers in each script, but you may want certain transactions (like when you submit the order) to have the same name in all scripts.

lr_start_transaction("Agg_1_0: Submit Order");
lr_start_transaction("Scenario_1_0: Submit Order");
lr_end_transaction("Scenario_1_0: Submit Order",LR_AUTO);
lr_end_transaction("Agg_1_0: Submit Order",LR_AUTO);

In the example above, you'll see that we can look at both the individual transaction times as well as the aggregate times across scenarios. If you need aggregate numbers, it's normally better to get the information from aggregate timers than by coming up with the numbers yourself. Many times when we try to aggregate the numbers ourselves we perform operations on the numbers that make the results no longer valid. For example, you can't average ninetieth-percentiles of each scenario to get the ninetieth-percentile for the run. If you want that, you should aggregate your timers. Otherwise you'll need to go back to the raw data and build your aggregate numbers from there.

Once you know what you need to measure in your scripts, you've got one last thing to think about before you start recording. Do you want to document the scripts you create and how do you do that? Don't get me wrong; I'm not a large fan of documentation for documentation's sake. Most scripts I write don't ever get documented outside of the usage model that references them. However, there are several factors that can get me to document my performance test scripts:

  • If there are several scripts, it can be a lot to keep track of and remember.

  • If the scripts are complex and require a lot of attention to detail during creation, it can be helpful to reference something when maintaining them.

  • If you need to maintain the scripts over a long period of time, it can be nice to have something to jog your memory.

  • And finally, if you have more then one or two people working in the scripts, it can be difficult to coordinate across the team.

When I do decide to document my scripts, I tend to travel light. First, I'll look for existing documentation. Perhaps my performance test script mirrors a functional script already documented. If it does, then all I need to do is annotate any changes I made and I'm done. If not, I'll do something simple like use screenshots of the data I entered include a paragraph that includes information about the intent of the scenario. And if I think I'll need to re-record the script often, I might just document the test case in a functional test script using a tool like Watir or Rational Functional Tester. That way when I need to re-record the script I can just play back the functional script with the performance test tool recorder turned on.

Once you're ready to start developing the scripts, with most tools you'll begin with a recording. After you record, that's when the real work starts. At a minimum you'll want to go in and add any additional timers and you'll want to update your think times. (Be sure your think times are outside of your timers.) You may also need to add custom correlations for values within the script, or you may need to parameterize variables for data you read in from various data sources.

For some applications, you may need to go into the code and remove any unneeded or unwanted posts or downloads. Those can include cascading style sheets, JavaScript files, images and other transactions depending on the goals of your tests. For many web-client applications you may also need to add custom function calls or write custom code using a toolkit of some sort.

Regardless of what you have to do, consider creating a scripting checklist. Something that reminds you of everything that needs to be done after you record your scripts. It can include some of the scripting tasks mentioned above, or it could include process steps like notifying team members, uploading final files or checking your scripts into source control. Whatever you find you need to do on a regular basis, it can sometimes be helpful to give yourself a reminder. When working in a team, it can also be helpful to track the status of your scripts somewhere that the entire team can access. That way everyone knows what work still needs to be done and where they can jump in to help out.

Calibrate your workload models
Once you get your scripts completed, you'll need to calibrate them to your usage model. Calibration means that you'll want to test and configure them to make sure they are doing what you think they are doing. To do that, you'll want to compare some key metrics from your production environment (or your requirements if you're developing a new product and don't have a production environment yet) to the metrics your performance tests generate, and you'll want to do that often.

For example, if you were testing the application represented in the usage model from part one of this series, you might look at a peak hour of load in production for a given month and discover that in that hour you had 1,000 member logins, 10 vendor logins, 1 administrator login, 15 new member accounts created, 5,000 product searches, 800 purchases, 200 order status checks, etc. Your calibration exercise would then be to get your performance test scripts to target (in a rough approximation) the same number of transactions, for those transactions that your model simulates.

Keep in mind that production users do every type of transaction possible, and many times your performance tests will do only a small subset of all the possible transaction types. That means you'll most likely never be able to get things dialed in perfectly when you calibrate, so don't try.

What you want to do is build a story of what your performance test does. That story might sound like this: "We have a performance test that simulates X users. Those users are executing these Y scenarios. When we baselined those users running those scenarios, we looked at these Z metrics to ensure that our model was configured correctly." X is the number of users you baselined with. Y is the number of scenarios in your usage model. And Z is the number of metrics you use to calibrate your scripts.

Here are some common metrics I use to calibrate my performance test scripts:

  • Transactions over a period of time: Examples might include logins/logouts per minute, form or file submissions rates per minute, reports generated per minute, web service calls per second, searches per second, etc. The idea is that you are looking to roughly approximate load (as determined by transactions) with your test.

  • Concurrent live and active sessions: I happen to do a lot of web testing, so sessions can be a big deal. Looking at the number of concurrent live and active sessions generated by my load test and comparing that to the production environment can give me an idea of whether or not I've got the right number of users in the test at a given period of time or if I've got the right amount of user session abandonment.

  • Percent connection-pool utilization: This is a specific example of a general metric. For any finite resource that might be important to your system, look at how that resource is utilized over your run and compare that to your target numbers. For example, if the production environment never uses more then 60% of its available connections, but your tests gets utilization of up to 90%, you might need to adjust your tests. Other things you might look at include CPU utilization, memory utilization and average queue depth.

Once you know how your test compares to your targets, you get to start making changes to your workload model. There are many ways you can make those changes. Calibration is an exploratory process. You'll make a change, rerun your test, see where you're at and make more changes. Over time, you'll get close enough to your targets that you'll feel comfortable running your tests.

There are two outcomes of the calibration process. One is hopefully a more accurate performance workload. The other is a tester who has an in-depth working knowledge of the scripts that are being used in the testing. Calibration forces you to understand the effect that each individual script has when run as part of the overall test. That knowledge is invaluable when it comes time to isolate issues and debug problems.

Some common changes and tweaks you can make to your workload while calibrating include the following:

Changing your think times and delays: The first thing most performance testers look at when calibrating is their virtual user think times. There's good reason for that, they play a large role in determining load. Lengthening and shortening think times and delays is a good way to make and control small changes in your calibration. As long as the corrective action you're trying to make is minor, this might be the right place to look. If you need to double the number of transactions, this might be the right place to make the change, but you might also look at some of the other possible changes below first.

Changing the number of users in the test or changing usage model percentages: Sometimes we are wrong about how many users are actually in the system when we plan our testing. There have been times when I've been calibrating and the numbers just aren't working out. Then someone on the team says, "Oh, gosh! You know what? I forgot that there's a web service that system X calls that also causes these transactions to occur. We forgot to model that." That means we just picked up a new user type and more virtual users.

The reverse can also be true. Sometimes you're doing too much or doing too much of the wrong thing. You can try to dial the load down or someone might discover that you have too many users representing a specific user type. Shifting the user percentages from one type to another (from our website example in part one, from New User to Administrator) can have a large impact on the load generated.

Changing your user ramp up time: How your users enter the test can be very important, depending on the application you're testing. Many times, you won't actively monitor performance until all your virtual users are in the system. That's when performance testing "officially" begins. But if your system has a memory of what happened for the 10 to 60 minutes leading up to when all your users are in the system, then ramp up time can be critically important. Make sure you understand what impact ramp up has on your system and try a couple different ramp up scenarios to test your assumptions.

Changing the number of iterations and time between iterations: In many tools, your virtual users are assigned a script to run and then they run that script for a specified number of iterations before returning to the virtual tester "pool" for their next script assignment. Changing the number of times a virtual user iterates on a script, or the time between those iterations, can be similar to changing the think times and/or the percentages in your usage model. Make sure you understand how those settings (for each script and user group) affect the overall workload generated.

Changing the test scenarios or test data: Sometimes you just can't get your existing scripts to do what you want them to do. Many times that can mean you simplified the problem too much when you modeled it. It happens. There's nothing wrong with going back and asking the question, "Is this really what our users are doing when they do this action?"

For example, imagine you're testing Dell.com where customers can configure their own laptops and place an order. You might have a script (or a number of scripts) that have a customer navigate to the site, select a base model, configure the laptop to their specification, and order the laptop. As you try to calibrate your workload, you find that you're getting more orders than you wanted. The only way you can think to solve this problem is to add 10-minute think times between pages for those scripts, but you know the users aren't doing that.

What you might discover when you got back to the data and take a closer look is that the more likely scenario is that a customer navigates to the site and selects a base model. Then they go back after viewing the base model and select a different model. They configure the laptop to their specification, see the price, and go back and reconfigure the laptop. They repeat that process until the price is right and then order the laptop. By changing the scenario, you find that you now get the number of transactions correct, but now you're also representing the additional load on the system of the users making those changes; where the 10-minute think times would not have accurately modeled that additional load.

The more you calibrate, the better you get at it and the easier it gets. The biggest challenge of calibration is knowing when you're workload is good enough. You can calibrate for weeks, each time getting closer, but at some point you hit diminishing returns. That point is different for each test, for each project, for each application or for each system. Calibration is also a very collaborative effort. You'll find that you'll probably have a lot of communication with the programmers, DBAs, the requirements team, etc. Making sure everyone knows you're calibrating and not testing, might help make things go more smoothly.

At the end of this phase
At the end of this phase you should have an initial set of test assets that should allow you to start your performance testing. You may not have everything you need to complete your performance testing, but many times you won't know what else you're going to need until you get in there and run some tests.

Performance testing is an activity that I've found requires a lot of rework if you try to plan for everything up front. You'll need to be constantly prioritizing the next test to run, creating any new test assets as you move along. If you've done a good job of collecting your initial data and you have everything you need to create your scripts, you should be able to adapt rather quickly.

Here is a possible summary of some of the work products from the build phase:

  • Performance test scenarios (documents or automated scripts)
  • Checklists to support ongoing maintenance and script development
  • Performance test data (spreadsheets, text files, databases or services that can be called dynamically)
  • Test environments scheduled and/or configured for performance testing
  • Test and monitoring tools installed and configured
  • Sets of performance scripts (code) calibrated to specific usage models
  • Updated documents from the assessment phase (strategy, diagrams, usage models and test ideas lists)

If you were in a contract-heavy or highly formalized project environment, you would now be ready for the first iteration of test execution and the first round of results reporting. If you were in a more agile environment, you most likely have already run many of the initial tests and workload models and could already be starting to work issues with the team.

The important point to remember is that this phase is about resourcing: obtaining tools and information to support your effort. To some extent, as a performance tester you're always resourcing (a fundamental aspect of testing is resourcing); this phase just focuses more on acquiring and building what you need and less on execution and providing information.

Summary of references for the build-out phase:

  • Testing for performance, part 1: Assess the problem space by Michael Kelly: This is the first article in this series, focused on developing an understanding of the problem you're trying to solve.

  • Developing an approach to performance testing by Scott Barber: This article outlines nine heuristics for thinking about the performance testing problem.

  • Using IBM Rational Functional Tester to test financial services applications by Michael Kelly: This article has a section on selecting data to use that applies to performance testing as well as functional automation.

  • Prioritizing software testing on little time and Model Workloads for Performance Testing: FIBLOTS by Scott Barber: These two posts outline the FIBLOTS heuristic for workload modeling.

About the author: Mike Kelly is currently a software development manager for a Fortune 100 company. Mike also writes and speaks about topics in software testing. He is currently the president for the Association for Software Testing and is a co-founder of the Indianapolis Workshops on Software Testing, a series of ongoing meetings on topics in software testing, and a co-host of the Workshop on Open Certification for Software Testers. You can find most of his articles and his blog on his Web site www.MichaelDKelly.com.

Dig Deeper on Topics Archive