Manage Learn to apply best practices and optimize your operations.

Are you ready for big data testing?

Data is everywhere and that means it's time to sharpen your big data testing skills. Expert Gerie Owen explains what you'll need to know to embrace this brave new world.

As testers, we often have a love-hate relationship with data. Processing data is our applications' main reason...

for being, and without data, we cannot test. Yet data is often the root cause of testing issues; we don't always have the data we need, which causes blocked test cases, and defects get returned as "data issues."

Data has grown exponentially over the last few years and continues to grow. We began testing with megabytes and gigabytes, and now terabytes and petabytes (PB) have joined the data landscape. Data is now the elephant in the room, and where is it leading us? Welcome to the brave new world of big data testing.

What is big data? 

Big data has lots of definitions; it is a term often used to define both volume and process. Sometimes, the term big data is used to refer to the approaches and tools used for processing large amounts of data. Wikipedia defines it as "an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications." Gartner defines big data as "high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." Big data usually refers to at least 5 PB (5,000,000,000 MB). Sometimes the term big data is used to refer to the approaches and tools used for processing large amounts of data. 

However, big data is more than just size. It's most significant aspects are the four "V's." Big data obviously has huge volume, the sheer amount of data; it has velocity, the speed at which new data is generated and transported; variety, which refers to the many types of data; and, finally, veracity, which is its accuracy and quality. 

Testers, can you see some -- make that many -- test scenarios here? Yes, big data means big testing. In addition to ensuring data quality, we need to make sure that our applications can effectively process this much data. However, before we can plan our big data testing, we need to learn more about the brave new world of big data.

Big data is usually unstructured, which means that it is does not have a defined data model. It does not fit neatly into organized columns and rows. Although much of the unstructured big data comes from social media -- such as Facebook posts and tweets -- it can also take audio and visual forms. These include phone calls, instant messages, voicemails, pictures, videos, PDFs, geospatial data and slide shares. So it seems our big testing SUT (system under test) is actually a giant jellyfish. 

Challenges of big data testing

Big data testing is like testing a jellyfish: Because of the sheer amount of data and its unstructured nature, the test process is difficult to define. Automation is required, and although there are many tools, they are complex and require technical skills for troubleshooting. Performance testing is also exceedingly complex given the velocity at which the data is processed.

Testing the jellyfish

At the highest level, the big data testing approach involves both functional and nonfunctional components. Functional testing includes validating both the quality of the data itself and the processing of it. Test scenarios in data quality include completeness, correctness, lack of duplication, and more. Data processing can be done in three ways: interactive, real-time and batch; however, they all involve movement of data. Therefore, all big data testing strategies are based on the extract, transform and load (ETL) process. It begins by validating data quality coming from the source databases, validating the transformation or process through which the data is structured and validating the load into the data warehouse.

ETL testing has three phases. The first phase is the data staging. Data staging is validated by comparing the data coming from the source systems to the data in the staged location. The next phase is the MapReduce validation, or validation of the transformation of the data. MapReduce is the programming model for unstructured data; probably the best-known implementation is in Hadoop. This testing ensures that the business rules used to aggregate and segregate the data are working properly. The final ETL phase is the output validation phase where the output files from the MapReduce are ready to be moved to the data warehouse. In this stage, the data integrity and the transformation are complete and correct. ETL testing, especially of the speed required for big data, requires automation, and luckily there are tools for each phase of the ETL process. The most well-known are Mongo, Cassandra, Hadoop and Hive.

Do you want to be a big data tester?

Testers, if you have a technical background, especially in Java, big data testing may be for you. You already have strong analytical skills, but you will need to become proficient in Hadoop and other big data tools. Big data is a fast-growing technology, and testers with this skill set are in demand. Why not take the challenge? Be brave and embrace the brave new world of big data testing.

Next Steps

Why big data isn't the only tech testers need to worry about

How to make the right career choices around big data

Hey testers -- are you ready for what’s ahead?

This was last published in March 2016

Dig Deeper on Mobile Application Testing Techniques and Tools

PRO+

Content

Find more PRO+ content and other member only offers, here.

Join the conversation

4 comments

Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Strange. Fault injections not even mentioned as a part of the strategy. In my encounters, divide and conquer has always been the main approach in data testing.
Cancel
How is your company getting ready for big data testing?
Cancel
Now that many companies jumped the bandwagon of "Business Intelligence" many purchase ETL tools and start playing. Unfortunately, the focus is mostly on tooling and SQL. "Testing" is biased towards confirmation that "it works" rather finding how it doesn't.
Cancel
Albert, completely agree with your statement. As a consultant I often find that most companies are willing to take the risk and move into the big data and BI tools, however, it is forgotten the real business objective of the company from which the business revenue area resides on. It is great to purchase ETL tools and play with data, but data (especially unstructured data sources) may mislead the main purpose on why are we taking this step and under which strategy was primarily researched on.  
Cancel

-ADS BY GOOGLE

SearchMicroservices

TheServerSide

SearchCloudApplications

SearchAWS

SearchBusinessAnalytics

SearchFinancialApplications

SearchHealthIT

Close