Your organization may already have a good handle on the high-level risks for cloud computing and even identified which risks belong to the business, which to operations, and which to the test group. Today we'll talk about tactics -- how your test approach should change to address those risks, along with guidance for how and what to test specifically.
New deployment risks
Latency, bandwidth, and the process of scaling out. You probably don't know exactly where your application is, but you do know that it's someplace different. You need to understand how far away the system is and what kind of bandwidth is between the system and the users. If you have concerns that your network might "get clogged," you can put policies or software in place to check availability and notify operations if the numbers are bad. You might also want to go home and run those tools at night, not just from within your organization behind the firewall. On a related note, when the system's getting busy and it's time to add additional computing power, that takes time -- a few minutes, usually. Until that new system comes up, performance is likely to be bad. If traffic comes in bursts, that can cause performance problems, as the cloud hosting company needs time to (a) recognize the need and (b) increase capacity with new virtual servers.
Build, deploy, rollback. All of the classic problems with deploying code still exist in the cloud, and you still need a rock solid upgrade (and rollback) strategy, which means you need to be testing that upgrade and the rollback. Further, since a move to the cloud often coincides with scaling out the system, you'll need to make sure the strategy covers rolling upgrades, potentially increased upgrade durations, and management of a multi-version configuration. If upgrading one server takes five minutes, how long does it take to upgrade ten servers? How about one hundred servers? And what happens if that nightly database backup starts in the middle of the upgrade?
Patches, server upgrades, and downtime. What kind of downtime should you expect from your cloud compute infrastructure, and how can you handle both planned and unplanned downtime? Downtime is a problem in data centers as well as in the cloud. To calculate downtime, you'll need to combine your downtime with expected downtime from the cloud vendor, to come up with an overall expected downtime. Don't forget to include your new estimates for upgrade and deployment!
Monitoring tools. Consider the number of servers, number of requests and average respond time. You'll likely have some requirements for performance management, regardless of how those requirements are expressed. Don't forget to figure out what monitoring tools your cloud provider or operations team has in place and account for those in your selection and use of monitoring software. Oh, and test with that software -- after all, it might be important to know that your monitoring software requires a certain port to be open on the firewall.
Shifting architecture and 'soft' points
With a traditional application we might have one physical box per tier: a database server, an application server, a Web server. With cloud computing, it's more likely that each of these functions is distributed across different servers. Now we have two database servers, four application servers, six web servers and a load balancer. We can handle a lot more traffic, but what problems did we introduce along the way?
Synchronization. If the users are physically distributed, some east-coast, some west-coast, we might have them log in to different servers, each with a local copy of the database -- perhaps only with local data, perhaps with a full copy of the datastore. You many work with one company with a European Web-cache for one customer set that has to synchronize between Australian offices and offices in the United Kingdom. What synchronizes? How frequently? What's kept separate, and how does that affect overall application reporting? Too much synchronization and we are wasting system resources; too little and all of a sudden you will get phone calls from users because they are seeing conflicts and errors.
Interrupts, race Conditions and deadlock. Most multi-user applications are at risk for intermittent problems. Race conditions or deadlocks happen when multiple users (or systems) try to modify the same thing at the same time. These conditions are difficult to test for because they are rare; trying something a few times isn't likely to expose the problem. Your test plan had better accommodate the identification of these kinds of conditions through architecture analysis, code inspection, and the use of rapidly repeated tests to exercise vulnerable areas of the system. As latencies and scale increase, the window of vulnerability for many of these operations increases, and simple luck no longer suffices as a detection technique.
Hooks and scaffolding
Going from theory to practice might leave you with more questions, such as "How can I automate the test for deadlock?" or "How can I monitor system response in real time?" Those are the exact right questions to ask. Without knowledge of your exact application, we can't tell you how to perform those tests -- but we can give some advice.
Applications need hooks and scaffolding to allow testers to drive them. One fancy term for this is design for testability. Put differently, there are all sorts of hidden requirements your test team may have for the software.
Automated testing for deadlock and monitoring responses as you ramp up load should be easy. You should be able to write some trivial scripts, in an overly simplified programming language or go to a Web page or enter a command to see response statistics. The problem is that feature doesn't exist yet.
Testability isn't just a nice idea; it's a set of real features for your application that can provide real value for the business. When you find you have to spend more time writing code than testing, it's likely the system itself is missing some of these hooks, and you are 'coding around them.'
Our suggestion is to write these up as formal feature requests, for the person holding the purse strings to trade-off against new features. It's possible the business chooses not to fund those features, but, in that case, when people ask why testing is taking so long, or why testing can't answer certain questions about risk, well, you have an answer: the business decided answering those questions wasn't worth the investment.
From risks to testing
There are many ways to institutionalize tests. You might write automation to check conditions -- for load testing, you likely will have to have some automation running. You might explore the software, or find some tests so valuable that you document and run them before every release. Exactly how you institutionalize the tests is up to your team. Just like any other testing, you'll need to identify how the system behaves functionally, how it performs, and how it continues to work as various nefarious things happen. All that's different about the cloud is how you go about doing those things. How do you, for example, test for synchronization issues? Better make sure your test environment synchronizes and isn't just on one box any more. How do you test scale out issues? Create tests to expose them. For every risk, the test team should be able to identify a way to see if that risk is realized.
Putting it all together
In this article, we created a census of technical risks for cloud computing. Most of the risks are similar to the risks for an internal server farm, and that's part of the trick -- internal farms are basically private clouds at a lower level of abstraction and with different architectural pressures. Cloud architectures encourage us to scale out, rather than the more traditional methods of scaling up. Instead of a few very large machines and some redundancy, we have many smaller systems and a lot more redundancy. This introduces a variety of new risks; we introduced them and covered some ways to explore and check for those risks above.
Testing for cloud architectures introduces new challenges to debugging, to reproducing problems, and to predicting performance.
It's going to be hard work; it's going to challenging. And it's going to be a lot of fun.
About the authors:
Matthew Heusser is a Software Process naturalist and consulting software tester. A contributing editor for Software Test and Quality Assurance Magazine and the senior editor for the How to Reduce the Cost of Software Testing book project, Matt has spent his entire professional life developing, testing and managing software projects. Read more from Matt on his popular blog Creative Chaosor follow him on Twitter as @Mheusser.
Catherine Powell has held numerous roles in startups and mid-size companies over the last ten years. Most recently, she was Director of Engineering and Infrastructure Services for Permabit, an enterprise storage system and deduplication company, and provided consulting and advisory services for Abakas. Catherine is also a noted author and speaker on Agile software practices, software testing and technical leadership.