As testers, we often have a love-hate relationship with data. Processing data is our applications' main reason...
for being, and without data, we cannot test. Yet data is often the root cause of testing issues; we don't always have the data we need, which causes blocked test cases, and defects get returned as "data issues."
Data has grown exponentially over the last few years and continues to grow. We began testing with megabytes and gigabytes, and now terabytes and petabytes (PB) have joined the data landscape. Data is now the elephant in the room, and where is it leading us? Welcome to the brave new world of big data testing.
What is big data?
Big data has lots of definitions; it is a term often used to define both volume and process. Sometimes, the term big data is used to refer to the approaches and tools used for processing large amounts of data. Wikipedia defines it as "an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications." Gartner defines big data as "high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation." Big data usually refers to at least 5 PB (5,000,000,000 MB). Sometimes the term big data is used to refer to the approaches and tools used for processing large amounts of data.
However, big data is more than just size. It's most significant aspects are the four "V's." Big data obviously has huge volume, the sheer amount of data; it has velocity, the speed at which new data is generated and transported; variety, which refers to the many types of data; and, finally, veracity, which is its accuracy and quality.
Testers, can you see some -- make that many -- test scenarios here? Yes, big data means big testing. In addition to ensuring data quality, we need to make sure that our applications can effectively process this much data. However, before we can plan our big data testing, we need to learn more about the brave new world of big data.
Big data is usually unstructured, which means that it is does not have a defined data model. It does not fit neatly into organized columns and rows. Although much of the unstructured big data comes from social media -- such as Facebook posts and tweets -- it can also take audio and visual forms. These include phone calls, instant messages, voicemails, pictures, videos, PDFs, geospatial data and slide shares. So it seems our big testing SUT (system under test) is actually a giant jellyfish.
Challenges of big data testing
Big data testing is like testing a jellyfish: Because of the sheer amount of data and its unstructured nature, the test process is difficult to define. Automation is required, and although there are many tools, they are complex and require technical skills for troubleshooting. Performance testing is also exceedingly complex given the velocity at which the data is processed.
Testing the jellyfish
At the highest level, the big data testing approach involves both functional and nonfunctional components. Functional testing includes validating both the quality of the data itself and the processing of it. Test scenarios in data quality include completeness, correctness, lack of duplication, and more. Data processing can be done in three ways: interactive, real-time and batch; however, they all involve movement of data. Therefore, all big data testing strategies are based on the extract, transform and load (ETL) process. It begins by validating data quality coming from the source databases, validating the transformation or process through which the data is structured and validating the load into the data warehouse.
ETL testing has three phases. The first phase is the data staging. Data staging is validated by comparing the data coming from the source systems to the data in the staged location. The next phase is the MapReduce validation, or validation of the transformation of the data. MapReduce is the programming model for unstructured data; probably the best-known implementation is in Hadoop. This testing ensures that the business rules used to aggregate and segregate the data are working properly. The final ETL phase is the output validation phase where the output files from the MapReduce are ready to be moved to the data warehouse. In this stage, the data integrity and the transformation are complete and correct. ETL testing, especially of the speed required for big data, requires automation, and luckily there are tools for each phase of the ETL process. The most well-known are Mongo, Cassandra, Hadoop and Hive.
Do you want to be a big data tester?
Testers, if you have a technical background, especially in Java, big data testing may be for you. You already have strong analytical skills, but you will need to become proficient in Hadoop and other big data tools. Big data is a fast-growing technology, and testers with this skill set are in demand. Why not take the challenge? Be brave and embrace the brave new world of big data testing.
Why big data isn't the only tech testers need to worry about
How to make the right career choices around big data
Hey testers -- are you ready for what’s ahead?