The present invention relates generally to the field of data warehousing, and more particularly to shipping of data during an extract, transform, and load (ETL) operation, specifically selective shipping of data directly to a stage that requires the data, by bypassing intermediary stages.
Data warehouses typically populate data through a process known as an extract, transform, and load (ETL). An ETL job is a sequence of processes called stages. Each of the stages processes records in a database that may contain multiple columns of data. Data records at a stage are typically received from the previous stage, called the upstream stage. After processing the column data, each stage passes the processed column data to the next stage, called the downstream stage. ETL operations thus require high-speed data movement through several stages for completion of the process. There can be several processes under ETL operations, such as merging data from various sources, cleaning data, copying data, transformation of data, quality validation of data, optimization of data, management of master data, management of metadata, etc. Each of these processes may further include sub-processes, for example, summary, aggregation, filtering, and splitting from one resource to multiple destinations, or vice versa.
During an ETL operation, or process, data undergoes various transformations. Broadly, the extract phase is a process for receiving data from various sources. The extracted source data is typically stored as one or more relational database tables. The transform phase in the ETL process is typically made up of several stages and includes converting data formats and merging extracted source data to create data in a format suitable for the target data repository. The load phase of the ETL process includes depositing the transformed data into the target data data repository. When the data repository is a relational database, the load process is often accomplished with structure query language (SQL) commands or other SQL tools. Thus, the ETL operation requires manipulation of column data via a sequence of stages/processes/steps. As data has to be transmitted, or shipped, through each of the intermediate stages, and several processes are involved in the completion of the operation, ETL processes may be very time consuming.
As a result of the requirement to transmit data through each of several successive ETL stages and, until data processing at each of the successive stages is complete, analysis of the data to support decisions cannot take place. Therefore, a system which may significantly reduce the time required for ETL operations to complete would be advantageous.
There is need for a solution that can resolve the problem of delay caused by transmitting data through each of the successive ETL stages, until the data is required at the stage/process receiving the data. It would be desirable to resolve the problem of delay by providing a solution whereby data is transmitted directly to the stage that actually needs to utilize the data.