Data loading into a data warehouse and its related task of data reconciliation is a meaty problem that needs to be clearly understood before a test strategy can be built. This tip takes a close look at the problems and possible test approaches to data loading and data reconciliation, testing often needed on Business Intelligence (BI) projects.
Understanding a problem clearly is important; when we don't understand a problem our approaches and solutions may be misguided. In software testing, we might be pressured to have our solutions in the ready. Sometimes we need to explain to our team members and executives that we need more time to research before we can propose a test strategy. We need to answer questions like: What does it mean to load data? What is data reconciliation? And what's so difficult about either of these tasks?
Let's step through the problem carefully. BI projects pull together data scattered throughout the organization to build a warehouse; once the warehouse is assembled, a wide variety of analytical and reporting tools are used to build more comprehensive analysis with the data. BI projects use the term, "ETL" which stands for extract, transform and load; these are the industry terms for pulling data out from a variety of sources, cleaning and/or conforming the data and finally loading the data.
Two core testing questions around the ETL process are often:
- How do we know we loaded all the data?
- How do we know we loaded the data accurately?
A quick solution might seem that we could execute a few SQL queries, make sure the number of records or the count of data before, matches the count after the data has been loaded – then we know all the data has been loaded, right? Maybe add a few SQL queries to spot check the actual data and we're "good to go" straight to production use, right?
A closer look at ETL reveals the potential for difficulties in answering our two core questions let alone answering additional questions like how the data warehouse handles updates over time. A closer look at the problem shows our original solutions may have to be tossed out as we see more clearly what problems are introduced as data is brought in from multiple sources.
Let's scale down from a high level concept into specific example. Imagine a company has used a variety of software applications, for our example let's focus on three applications: Accounts Payable, Supplier Management and Inventory Management. Suppose each of these applications maintain their own vendor information. Your BI project might want to load vendor information from all three of these applications to have one central place to look up and analyze vendor or supply data.
Imagine each vendor is assigned an ID in the Accounts Payable application and that the vendor name field can be as long as 100 characters and that there is no verification of the vendor names to any other software in the company. This might mean the data has no standardization on words such as Inc or Incorporated, Company or Co. or Corp and that the vendor names might not match customer records from elsewhere in the company.
At a more technical level, the data type that stores the vendor names can vary from one application to another – making the data in a warehouse merge together as well as oil and water. There can be several ways data from source systems doesn't align well together. Naming standardization and data type mismatches are two but there's also the handling of data updates, and de-duplicates processes to consider.
As you move onto review the vendor information stored in the Supplier Management software, you might find vendor names are never longer than 80 characters and that vendor IDs do not match with the vendor IDs assigned in Accounts Payable.
Move onto the third data source, the Inventory Management software and you might find the vendor list is maintained by Nancy Sorenson in purchasing and that she updates the list as often as needed and then uploads her spreadsheet to the Inventory Management system to keep the vendor list up to date. You might find the records she maintains don't correlate to either Accounts Payable or Supplier Management. Nancy checks for duplicates on a regular basis and has her own naming conventions for handling words such as The, Incorporated and Company just to highlight a few basic examples. In fact, the office rumor is that Nancy's vendor list is the best source and most accurate source in the company, and that might be true. Is the data source now the Inventory Management System or Nancy's spreadsheet? Questions like this arise as the data sources are identified and where the original true source is found.
Now you have three different data sources for one bit of information, vendor names and vendor IDs and those three sources don't match to each other. Each of these three data sources has its own update process, a different means for preventing or handling duplicate entries and possibly different rules for creating new vendor records.
As the BI project continues, you might find that sales person Stephen Downs maintains yet another vendor list, he maintains the international vendors list and he doesn't usually work with anyone else in the company including Accounts Payable. You've not only gained a forth data source but may have gained international names that accommodate different language including accent marks known as diacriticals.
The theory that you'll be able to load data, run a few SQL counts and maybe spot check a few records becomes a distant memory and seems like a quaint solution. What now?
Don't despair, and don't think right now about the fact that there may be 234 more data fields moving into the data warehouse that will be pulled from seven more data sources. Stay focused on understanding the problems and nuances of each data source before driving towards a solution.
Data field by data field and data source by data source has to be understood before a solution can be proposed. Each data source and possibly each data field may require a different strategy. Start with an inventory of all the data fields that will be moved into the warehouse. What data is being loaded? Where is it being loaded to specific database, table and column? And where is it being loaded from? These questions boil the problem space down to: what are we working with?
If you're a visual person, then the problem can be mentally viewed as a data map. Visually drawing arrows if needed to picture where data is coming from and where data will be loaded. Consider figure 1.
If you're more of a word person, then a data dictionary might be all the visual understanding you need. A data dictionary on a BI project is typically created as a spreadsheet that identifies each data field being brought into the data warehouse and where that data will be loaded inside the warehouse. In addition to the field mapping, the data dictionary identifies field length, data type information and other business rules about the data field.
The field by field review is the first step in understanding the data. And since data is a living source, meaning data is updated on a regular basis, consider the ongoing update process as the next step in planning.
Sometimes the best way to know you understand the problem isn't having an answer but instead lies in having better questions formulated about the problem. Having walked through the example of one data entry field and thinking about your own BI project, consider what you know as you review the following questions:
- What data sources are we including?
- What data fields are we loading into the warehouse?
- What are the business rules around each data field?
- Are there discrepancies in the business rules between the data sources?
With these questions clearly mapped out in your mind, visually or verbally, you can begin to build solutions for testing data loading and data reconciliation for your BI project.