The present invention relates to a method for identifying reroutable data columns in an ETL process and a method for processing data columns in an ETL process. The present invention further relates to a computer program product and a system for executing the above methods.
ETL processes refer to processes for Extraction, Transformation and Loading (ETL) data. Such ETL processes are commonly used in data integration applications to integrate data from at least two different sources. Such data integration applications are e.g. used, to integrate data of different companies in case of mergers and acquisitions, or when data from different departments inside one company has to be integrated.
ETL processes comprise a set of stages, which operate on data. The stages are connected via links, so that the entire ETL process forms a data flow graph from a source stage, where the data is loaded, to a target stage, where the data is stored. In between, the data can be processed, combined, passed through or transformed in any way. The data itself is represented in records, which contain a set of different columns as individual data elements. Individual steps of the processing of the data from the source to the target can be performed in a parallel implementation, e.g. on multiple processors of a single hardware platform or on multiple hardware platforms, which are connected by a network connection. Such distributed execution can be implemented nowadays by design tools for ETL processes, e.g. IBM® InfoSphere® DataStage®, which can automatically implement parallel processing without a user-provided configuration. (IBM, InfoSphere and DataStage are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide.)
In ETL processes, storage of sources and targets as well as the processing of data can be distributed. A source database can be distributed over different network locations, the processing of the source data being separated in a set of stages, where each stage may be performed individually on a single hardware system, and a target database may be distributed over different physical systems, the systems being different from the systems of the source database. A single stage of the processing can also be implemented in parallel to be executed on multiple cores or processors of a single computer or multiple computers.
ETL processes generally are applied to very large amounts of data and therefore consume large amounts of resources. This refers to processing power as well as a memory or storage capacity, and/or, in the case of distributed implementations of the ETL process, network resources. Data unnecessarily transported from one stage to another, when the data is not used in the particular stage is referred to as bulk data having respective bulk columns.