Field of the Invention
The present invention relates to data profiling following extraction, transformation, and loading (ETL) processing for a source and target system, more particularly, it applies to data profiling within an enterprise service bus (ESB) coupling the source and target systems.
Description of the Related Art
Extract, transform, and load (ETL) is a data movement process between systems, in many cases between a relational source and target databases. A very common usage is the introduction of a new data warehousing environment where data is extracted from operational source systems, transformed to a common model, cleansed, and then loaded into the relational database of the data warehouse for the initial population of the data warehouse. There are many other use cases beyond the example provided, including use in the introduction of a new application in the IT environment.
The first part of an ETL process involves extracting the data from the source systems into a persistency for further processing. The persistency can be flat files, but are typically relational databases. The persistency has typically three major areas: (1) a staging area (STG) modeled after the sources, (2) an alignment area (ALG), which is typically modeled as closely as possible to the target, but might have some changes to be able to accommodate all records from all sources, and (3) a preload area (PLD), which is modeled exactly after the target.
Data profiling can be applied in the STG area, which has a sub-area per source system. If data profiling is applied to the STG area, the data profiling results would be per source system. For example, frequency distribution statistics in such a case would only count the distribution of values per source system. If the data profiling is applied to ALG, the frequency distribution for values would lead to statistics across all records for all sources. In both cases, data profiling analyzes value distributions within a column (=an attribute) as well as patterns in the data. Furthermore, data profiling can check if the values found comply with certain constraints and business rules often across several columns—this type of data profiling is known as semantic profiling. Note that data profiling is IO-intensive, which is why it is typically not applied on the source systems directly, but rather on the persistency used by the ETL process.
Once the data quality issues are identified through data profiling, the transforming operations are applied, including data cleansing logic and, finally, to transform the data into the data model of the preload area. From the preload area, the load step of the ETL process is performed by loading the data into the target system. Data cleansing logic is a subset of the transformation operations and is used to correct the incorrect data identified during data profiling.
There are two key objectives in the transform step. The first objective is to condition the data in such a way, that the data is accepted by the data model and processing routines on the load interface used in the target system. The second objective is to fix all data quality issues of data coming from the source system before loading it into the target system so that no business process or purpose of the target system is negatively affected by low quality data.
In conventional scenarios, like with new application introduction, the data moved from the source to the target system is no longer monitored after go-live of the application with data profiling. Thus, even if the new application or the data warehouse might initially have had clean data, the data quality might degrade over time without being noticed, because in the near real-time or real-time integration with an enterprise service bus (ESB) connecting the source and target systems after go-live no data profiling is applied. In addition, ETL projects can be extremely labor intensive and time consuming. Performing an ETL process often to fix data quality issues, though, is not realistic, because the ETL process is very costly and causes an unacceptable amount of downtime for both the source and target systems. The net result is that data profiling is not deployed on a regular basis so that changes in source applications after the data is moved to the target systems through the ESB can cause data quality issues that remain undetected.