On a business intelligence (BI) project, multiple data sources are often used to bring data into a data warehouse. Two core testing questions around the process are: How do we know we loaded all the data, and how do we know we loaded the data accurately?
This article is part two of in a two-part series focused on the testing challenges associated with data loading and data reconciliation as related to BI projects. In the first part (insert link here), the challenges of working with multiple data sources were reviewed. In part two, we look closely at ways to approach testing.
Our imaginary context is based on a BI project where 234 data fields are being brought into a data warehouse from seven different data sources. We began with a review of one data field and four different data sources, just to get a feel for the types of issues this scenario can create. Let's stay focused on one data field as a specific example as we begin to model our test approach.
Data field: Vendor ID
Suppose the vendor ID field has preceding zeros in the Accounts Payable system and that these preceding zeroes will be stripped off before the data is loaded.
These are some of the distinct tests you may identify.
- Check the field length of the vendor IDs that are loaded from Accounts Payable.
- Investigate to find out if there should be any exceptions to this rule meaning, will any vendor IDs maintain a preceding zero. The purpose of this is to look for situations where the data creates an exception to the business rules.
- Investigate to find out if there will there be any duplicate values once the preceding zeros are stripped with vendor IDs that are being loaded from other sources.
- If the original vendor ID will also be stored in the warehouse, check the original values are stored intact in whatever table or column this information is being loaded to. Consider setting a sample size before loading and check a specific data set after loading.
This is only one data entry field but you begin to realize the magnitude of the testing effort. You begin to identify concerns such as the ones outlined for each of the fields being loaded into the warehouse. You determine where to focus testing resources by reviewing risk areas – just as you would with an application – data has risk and each risk establishes a sense of priority or "pecking order" that guides where time, effort and resources will be spent. Determining a sample size might be the next step in moving the testing from vast and unmanageable to something achievable and quantifiable.
We might also come to realize by reviewing some of the existing production data, that although the data is varied – it might not "naturally" include or represent some of the challenging data we want to see in our testing. It might mean that we build a set of data, dummy test data that can present those challenges.
On one project, the project owner thought there would be enough data "variety" and "challenges" just working with the production data. But after a review of some of the data, we discovered this was not the case. Building re-usable test data that pushed to and beyond the boundaries became important and a decision was made to make that investment of time.
In addition to challenge data, you may want to test error handling and assign someone to specifically review the error logs after or during test runs of data loading. You may choose to select or create data such that the data will fail specific criteria and be prevented from being loaded into the warehouse in order to purposely test error handling - then watch and check the information trapped in the error logs. Going back to a specific data example, you might find a vendor record that somehow does not have a vendor ID, the ETL (extract, transform and load) job might intentionally be set to flag such a record and log an error.
You can work closely with the developers to identify these conditions and coordinate testing to check error handling and logging. When you uncover the specific data rules, you'll be able to map out those specific tests needs.
If data is intentionally being left out of the warehouse because it does not "pass" specific criteria, this is another reason why pure SQL counts of the data before (in the original source system) and the data after (in the warehouse) may not match. A logical reaction might be to run the math: if 10,000 rows exist in the source and 9,800 rows are loaded, can you point to the 200 exception cases in the error log and therefore balance or reconcile the data as all being accounted for? Maybe. Maybe isn't a great answer but its true – the error handling is no more tested or verified than the data loading process so at this particular point – if a discrepancy is found – you won't know which is at fault or why. If the tally doesn't total accurately, you won't know whether the issue was the data loading process or the error handling process. After multiple projects, I've seen the same conclusion reached – balancing or reconciling SQL counts does not provide the gauge of accuracy you might suspect.
Planning robust challenge data and error handling are two areas that bring tremendous comfort to project owners and stakeholders. Nearly everyone on the team no matter their level of technical expertise can readily understand these terms and gain faith in the data warehouse with the knowledge that these areas have been tested thoroughly.
During the original loading of data, SQL counts might seem like the rock solid choice but once data is being updated, ETL jobs are running day and night from a variety of data sources, SQL counts can no longer be so readily achieved or provide a sufficient sense that the data is moving and accounted for. By testing data at and beyond specific business rules and having tested error handling is more likely to provide that ongoing sense of assurance.
Data loading and data reconciliation testing is one of the least glamorous types of testing and one of the least understood forms of testing. It's all about the data, right? And deep understanding and testing of data bores most people – unless the data is years' worth of some of the most important company information, the type of data that analysts can use to draw conclusions and help steer the business – true business intelligence – then the data become interesting to many people, especially the executives. But the tasks of test data loading and data reconciliation has no magic silver bullet, it is nit-picking detail oriented task, it requires a deep understanding of the data and the business rules around the data.
Testing is achievable, but it begins one data field at a time and like other types of testing, having a sense of where the risks and priorities lie helps provide direction. By planning the testing needs of the most critical and thorniest data fields, the remaining planning and execution – the true "how are we going to do this" -- gets resolved.