A typical modern enterprise has a plurality of applications used to handle day-to-day operations of the business. These applications are often referred to as On-Line Transaction Processing (OLTP) applications. There is often need to relate the data stored in multiple data stores and report on it. A Data Integration project typically involves collecting data residing in multiple sources (e.g. databases, flat files) and making it available for users in the form of reports or applications. Data Warehouse is one of the most popular Data Integration architecture that involves extracting data from different sources, transforming it and loading (ETL) it into a target data store. Data Warehouses are also typically referred to as On-Line Analytical Processing (OLAP) applications.
Business intelligence projects make use of OLAP tools (e.g. Oracle Business Intelligence Enterprise Edition®, SAP Business Objects®, IBM Cognos®) to build analytical applications with data federated from multiple data sources, typically a data warehouse. The OLAP tools provide a layer of abstraction that allows the user to perform analysis without knowing the underlying details of the database tables. The developer of the analytical application creates and modifies the metadata model of the OLAP tool based on requirements for report data elements. This OLAP tool metadata model specifies the definition of reporting data elements using the data structures available in the physical data stores as well as the relations between them. The query engine of OLAP tools uses the metadata model to automatically generate the queries used to retrieve data from physical data sources and provide user friendly reporting views based on user request.
Implementation of a data integration and business intelligence project is generally complex in nature due to the multiple data sources involved. Data quality and performance are often the most important aspects determining the success of these projects. Testing is critical for the success of data integration projects so that users can trust the quality of the information they access. While most aspects of the design of data integration and business intelligence systems have received considerable attention, existing techniques do not provide testing of data integration and business intelligence systems.
Testing of data integration projects is a difficult task due to the multiple data sources involved as well as the large volumes of data. Testing is typically done by running queries against the source and target data sources and manually comparing counts. Running and comparing of data for large number of test cases is a tedious process when done manually. Limited automation has been achieved in some cases by developing custom scripts to compare the counts of data between different data sources or comparing query output to benchmark (baseline) data stored in flat files. This type of automation often requires coding and is limited to testing of smaller samples of data. Testing is further complicated because ETL process can run in incremental mode or full mode. When ETL runs in full mode, it refreshes or reloads all the data in the data warehouse. During incremental mode, only the data changes occurring since the last ETL run is loaded. The differences in the incremental and full load mode of the ETL process often make it necessary to create separate sets of test cases for these two modes.
Testing of Business Intelligence projects is also very challenging due to the abstraction provided by the OLAP tool metadata model. The primary advantage of a business intelligence system is that it provides capability for the report analyst to create and run new reports using report data elements supported by the metadata model. But from a testing standpoint this is a huge problem because the scenarios for report requests are not fixed and testing all the scenarios is often impractical due to the large permutation and combination of possible reporting queries. As the business intelligence application matures, it becomes very difficult to assess the impact of the changes in the metadata model on existing reports because of the effort involved in testing all the reports. Most of the current forms of testing involve manual spot checking by running a few reports and comparing it to the data queried from the physical data sources (e.g. databases). Often there is very little or no regression testing performed for any changes being made for these applications since there is no easy way to automate this testing. The impact on existing reports can vary from throwing errors, showing wrong data or bad performance which is usually detected once the change goes to production. This leads to lack of trust from the users on the data shown in the reports by the business intelligence applications.
As agreed by persons skilled in the relevant art(s), there are several differences when it comes to testing of OLTP and OLAP applications: the focus of OLTP application testing is on software code while OLAP application testing is directed at the validation of the correctness of data; the volume of data involved in OLAP application testing is typically very large when compared to volume of data involved in the testing of OLTP applications; data integration projects present different set of challenges for testing of full and incremental loads; performance testing of data integration projects presents different set of challenges including the need for large volumes of test data when compared to OLTP applications; and the number of use cases for OLTP applications are finite while the test scenarios for regression and performance testing of OLAP applications can be virtually unlimited.
Plurality of test tools (e.g. the HP® Quality Center) are available for testing OLTP applications that allow record and replay of UI based transactions as well as testing web services (e.g. SoapUI). Limited automation has been achieved by using these test tools for testing of analytical applications by using them to record and play running of a report from the UI but these tools provide limited support if any to compare the report data or the database query generated by the business intelligence tool. Prior art does not provide any simple and automated way to compare the data samples dynamically retrieved from data sources or allow these data samples to be compared with the samples from previous test runs for large volumes of data.
FIG. 1 is a prior art high-level flowchart view of an exemplary Data Warehouse (OLAP) project which is one of the most common types of data integration project. Reporting requirements 103 typically form the input for the design and development of the ETL (or ELT) 101 and OLAP metadata 105 in a data warehousing project. Data is extracted from multiple sources 100 and loaded into a data warehouse 104 using an ETL (or ELT) process 102. An OLAP tool 106 is generally used to report 107 on the data in the data warehouse. Unlike regular reporting tools, OLAP tools have a query engine that generates the query dynamically based on the report request and the metadata model defined.
Testing of the ETL (or ELT) metadata 101 is typically done by querying existing data in source data store 100 or by entering some sample test data 108 using the application interface of the source data store 100 and Verifying Data Loaded 109 after it is extracted, transformed and loaded correctly into the Target Data Store 102. Since the ETL (or ELT) process 102 can run in incremental mode where only the data that changed since the last ETL run gets extracted, transformed and loaded or full mode where the tables in the target data store 104 is truncated and reloaded with all the data from the source data store 100 testing will need to be done for the two modes separately.
Testing of the OLAP tool metadata 105 can involve large number of reports since the tool allows for potentially infinite number of reports to be created dynamically by the user in the production system. Typically, testing is done by Verifying Report Data and SQL Query Generated 110 for a core set of reports 107 is same as expected by querying the target data store 104. The query generated dynamically by the OLAP query engine 106 can also be verified to make sure that the right query is getting generated for these core reports every time there is a change to the OLAP tool metadata model.
Integration testing can be done by verifying that the data displayed or queried from the source data store 100 matches with the data shown in the reports 107.
However there are several challenges in running these tests. The data volumes involved in the testing are typically very high and there can potentially be multiple source data stores 100 of different types from which data gets loaded into the target data stores 104. The potentially large number of reports that can be created using the OLAP tool presents a testing problem by itself. Comparing the data from the reports of an analytical application with that of the source or target data stores can also be a challenge since they are produced in different formats in different software programs. Verifying that the query generated by the OLAP Query Engine 106 has not changed due to an OLAP metadata 105 change is also a very tedious process due to the large number of reports as well as the complex queries being generated by the OLAP query engine 106.
As such, there exists a need in the art for an automated data extraction and validation process overcoming the limitations illustrated in FIG. 1 above.