The present invention relates to data warehousing, and in particular to ETL (Extract, Transform, Load) transformations for loading data in the data warehouse.
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
The data generated by the different organizations (e.g., marketing, manufacturing, sales, etc.) comprising an enterprise are typically stored in a data repository commonly referred to as a data warehouse. The execution of data handling processes include: exporting the data from the operational data sources in each organization (e.g., marketing analyses, manufacturing inventory databases, sales databases, customer relationship management database, etc.); transforming the exported data into the format of the target tables of the data warehouse; and loading the transformed data into the data warehouse. The category of tools that are responsible for this task is generally referred to as Extraction Transformation Loading (ETL) tools. The functionality of ETL tools can be coarsely grouped into the following tasks: (a) the identification of relevant information at the source side; (b) the extraction of this information; (c) the customization and integration of the information coming from multiple sources into a common format; (d) the cleaning of the resulting data set, on the basis of database and business rules, and (e) the propagation of the data to the data warehouse, a data mart, and/or the like.
Defining the proper data transformations is an important aspect of populating the data warehouse. An enterprise may require numerous ETL processes to input and process data from myriad data sources and to load the resulting output data. The typical workflow for developing an ETL process includes defining the source data (which may constitute multiple sources of data), specifying one or more data transformations to massage, analyze, or other transform the data, and loading the resulting transformed data into one or more targets (e.g., data warehouse, data mart, and so on). The results of the transformation can then be analyzed. If the results of a particular ETL process are incorrect or otherwise unacceptable, a user may need to modify the extraction process and/or the constituent data transformations, and run the process again. In any significant enterprise, a large volume of data is extracted from the various data sources and transformed, and so an ETL process can take on the order of hours to complete. Accordingly, fine tuning or debugging the ETL process can be a time consuming effort. The problem can be exacerbated in a situation where the ETL process must be completed within a certain window of time.
These and other issues are addressed by embodiments of the present invention, individually and collectively.