Globalization of local businesses has resulted in a fierce competition between organizations to capture market share in a constrained global market. Companies today rely on various available data to make analytical and calculated business decisions required to maintain and increase its presence in the market. The data created/employed by a company or its branches is collectively stored at a central data warehouse. The data warehouse is a repository which stores the entire collection of current and historical data related to a company. In order to perform analysis of complex business decisions, the data stored at the data warehouse is tested for any erroneous datum using various data testing techniques. The error free data helps in making right business decisions critical for an organization.
Further, the data stored at the data warehouse is extracted from one or more databases, which are located at various branches of the organization. To efficiently store the extracted data, at the data warehouse, a process called Extract, Transfer, and Load (ETL) is used. The ETL defines a three stage process of extracting data from the one or more sources, transforming the data and subsequently storing it at the data warehouse. The extracted data is transformed using a series of transformation rules and functions to derive the required data based on the functional requirements of a business. Thereafter, the transformed data is loaded in the data warehouse for further usage. To maintain the sanity of data at the data warehouse a data quality-accuracy test is applied on the data stored at the data warehouse.
Presently, the data quality-accuracy test employs at least one of a data quality analysis test to check the attributes of the data stored at a data warehouse and a data comparison test to check the accuracy of the data stored at a data warehouse. The data quality analysis test is performed by comparing various functional attributes, such as definition of columns of respective data, patterns of respective data and so forth, of the data warehouse data (DWH data) and the sample data (data generated to perform the test). The DWH data is the data stored at the data warehouse, which in turn is derived by ETL process (as explained earlier). The sample data is derived from the one or more sources/databases after applying the same transformation logic that was used while storing data in the data warehouse (DWH data).
The data comparison test involves comparison of the DWH data to the sample data for a specific test case. For both the testing techniques i.e. data comparison test and data quality analysis test, a tester creates specific test cases. Further, the tester is required to develop specific scripts to extract and thereafter compare respective values, which is an expensive methodology in terms of both cost and effort. In addition, the entire data set stored at the data warehouse cannot be efficiently verified because of the huge amount of data (stored at the data warehouse).
Moreover, the tester is required to possess the knowledge of various databases, query languages, and scripting languages, to write scripts to design and execute respective test cases for performing testing of data at the data warehouse. Furthermore, due to the lack of any centralized testing system, the tester has to develop individual test cases for every data set from scratch, which results in wastage of time and effort.
Additionally, the tester utilizes various spreadsheets, such as Microsoft (MS) Excel and Perl Scripts, to compare the DWH data with the sample data while performing data quality-accuracy analysis which is quite tedious. In case, there are more than 65,536 rows of data to be compared, the tester has to write complex scripts for implementing the ETL comparison which in turn results in increase in effort and time required from the specialized tester.
In light of the abovementioned disadvantages, there is a need for a framework which provides an interface to develop and perform data quality analysis testing and data comparison testing at a data warehouse with minimal manual intervention and increased efficiency.