WavebreakmediaMicro - Fotolia

Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Tools and methods for testing big data applications

Data is more valuable than gold for businesses, so QA must ensure that they test big data applications and their output thoroughly and accurately.

As testers, we rarely test data at all, except to ensure data quality. So, when it comes to testing big data, it can be hard to figure out where to start.

Generally, big data refers to data that exceeds the in-memory capability of traditional databases. More than that, it typically involves the collection of large amounts of disparate information on customers, transactions, site visits, network performance and more. Organizations must store all that data -- perhaps over a long duration.

The term big data also implies that such data is used for a business purpose. IT professionals rarely have to search big data stores for a particular value or field. Rather, big data feeds analytics, which typically expresses results in statistical terms, such as trends, likelihoods or distributions -- not logs. Big data applications are all about analytics, not data queries.

Testers must make sure data collection is a smooth, unencumbered process as big data testing becomes integral to enterprise app quality.

Big data is supported in IT organizations with a combination of inexpensive storage, different types of databases, and powerful -- and readily available -- computing (see "Open the tool chest" below).

Testing big data applications

Testers don't generally test the data itself. But to correctly test a big data application, they need an underlying knowledge of the database type and data architecture, as well as how to access that database. It's unlikely that testers will use live data, so they must maintain their own test environment version of the database and enough data to make tests realistic.

The applications that rely on analytical output are not all the same. Rather than query the database for a specific result, a user is more likely to run statistical and sensitivity analyses. This means that the correct output -- the answer -- depends on distributions, probabilities or time-series trends; it's impossible to know answers ahead of time, because they're often trends and complicated calculations, not simple fields in a database. And those answers won't be obviously correct or incorrect. All of which creates a lack of certainty for testers designing test cases and analyzing results.

Find ways to test

Despite these challenges, it is possible to test big data applications that rely on analytics, but testers' expectations must change. First, testers should look at a collection of data as a resource to determine statistical results, rather than just something to query. This means testers need to use more than just a sampling of big data. The good news is you don't have to worry about obfuscating data, because you aren't querying it.

Open the tool chest

Many types of tools support big data applications, including options for storage, processing and querying. Here are a few commonly used tools:

  • Apache Hadoop, an open source tool, offers distributed storage and processing. Additionally, Hadoop Distributed File System stores data across multiple machines, while Hadoop MapReduce provides parallel processing for queries.
  • Apache also offers Hive, an open source data warehouse system, against which data scientists and developers can make queries with a SQL-type language. Pig Latin, a query language written for Apache Pig, helps teams analyze large data sets. It can handle complex data structures and NoSQL, which is often used to query unstructured data.
  • Many departments use Microsoft Excel for big data analysis. They can use the Power View feature to visualize data, which supports data imported from Hadoop. Also, Microsoft's Azure HDInsight supports multiple open source frameworks, including Hadoop, to enable queries on big data stored in the Azure cloud.
  • The maturation of NoSQL databases, such as MongoDB and Couchbase, make it possible to more effectively mine big data for analytics. And specialized databases can accommodate specific uses, such as in-memory databases for high-performance and time-series databases for data trends over time.

When testing big data setups, throw out test case conventions. You won't usually look for a specific and known answer. Instead, you'll look for a statistical result, so the test cases have to reflect that. For example, if you test big data collected by a retail website, you must design test cases that enable the team to draw inferences on buying potential from all the information regarding customers, their searches, products added to carts, abandonments and purchase histories.

Last, you must not evaluate test results by their correctness, as there is no easy way to determine that. You might have to break the problem into smaller pieces and analyze tests from each piece. Use technical skills and problem-solving creativity to determine how to interpret test results. For example, when you test big data collected from social media, you might need to examine each separate social media channel to make sure displayed advertisements correspond to user buying behavior.

Dig Deeper on Software test design and planning