A data warehouse represents the result of a transformation of raw data from an organization's data sources into a database that is accessible for query and analysis. Typically, data is moved from a source database into a data warehouse in three stages: extraction of data from the source database, transformation of the extracted data, and loading of the transformed data into the data warehouse. These stages are referred to collectively as ETL, which stands for extraction, transformation, and loading. A database may be a relational database, an object-relational database, or a set of flat files. For example, a source database may be an online transaction processing (OLTP) system that is oriented towards the “real time” operation of a business. The data is extracted from the OLTP source database, transformed, and loaded in a data warehouse. The transformed data in the data warehouse may be used by an online analytical processing (OLAP) system that can answer longer range, management-oriented questions about the business.
ETL development tools are software tools that are used to extract, transform, and load data into data warehouses. Conventional ETL development tools consist of a variety of components, including a data movement planner and data movement engine. A data movement planner generates data movement plans. Data movement plans include ETL components that define ETL operations. ETL components include extraction components, which define an extraction operation from a source database, and transformation components, which define operations like join, aggregation, and filter. A data movement engine executes the data movement plan.
An advantage of a data movement plan is that they may be represented graphically as a data movement graph that consists of interconnected icons representing ETL components. Such data movement graphs help a user to visualize a data movement plan, greatly improving the users ability to unravel and understand the complexity of data movement plans. Once the user has entered input defining the data movement plan, the data movement planner generates the data movement plan.
ETL development tools have various drawbacks that stem from the fact that they are based on a paradigm that requires that the data movement engine be a different system than the source database system and/or data warehouse and that imposes a dichotomous treatment of the extraction and loading components. Under this paradigm, the data movement engine must interact with another database system to perform operations for the extraction and/or loading components.
Furthermore, the extraction component is treated as a black box, a simple query (source query) not to be modified but to be simply provided to the source database system for execution. The source query is provided and designed by a user, and conforms to a native database language that is supported by the source database system. When a data movement plan is executed, the data movement engine submits the source query to the source database system. In response, the source database system computes the source query and supplies the results to the data movement engine. The results serve as input to the remainder of the execution of the data movement plan.
Once the engine transforms the data, the data is loaded into a data warehouse. Because the data movement engine and data warehouse are different systems, the transformed data is transferred to the data warehouse, which then loads the transformed data.
A major drawback suffered under this paradigm is that the data movement engines ability to optimize data movement plans is substantially impaired for several reasons. First, data movement engines lack access to information that is needed for optimization. This information includes, for example, statistics on tables and the number of rows currently in the source tables. Typically, a source database system has this sort of information. Even if a data movement engine has access to the information, it does not use the information to optimize data movement plans.
Second, because the extraction component is treated as a black box to be executed by the source database system, the data movement engine cannot push the execution of some operations defined by a transform component to the source system, causing the source database system, data movement engine, and data warehouse to perform wasteful work. For example, a data movement plan defines a transform component in the form of a filter operation. It is more efficient to push the filter operation to the source database system by adding a filter to the source query. However, because the data movement engine treats the extraction component as a black box which cannot modified but must simply be provided to the source database system for execution, the source database system executes the source query and extracts data that does not satisfy the filter's criteria. The data movement engine applies the filter only after the source database system extracts data. Thus, work is wasted extracting and transporting over a computer network data that does satisfy the filter condition and that could never be moved to the target database system.
To compensate for this inability to optimize data movement plans, the user must manually optimize the execution of the data movement plan by defining the extraction component to include operations that would otherwise be defined by certain transform operations. For example, a data movement plan includes a join and aggregation operation, among other operations. So that the join and aggregation operations may be executed more efficiently, the user creates a source query that specifies these operations, causing them to be performed by the source database system. The user intentionally foregoes defining transformation components in the data movement plan that perform the join and aggregate operations in the extraction component, so that the operations are not performed by the data movement engine and are instead left to be performed by the source database system.
While this strategy may work, it has disadvantages. First, the strategy requires the greater skill and effort needed to program more complex source queries. The strategy also deprives the user of the substantial benefits of using the computer aided design functions available for defining ETL components.
Finally, source database systems change. Many of the changes must be propagated to data movement plans. Propagating the changes may entail examining the code previously entered for ETL components—a manual process which can be very cumbersome and expensive.
Based on the foregoing, it is desirable to provide a mechanism that may be used to implement movement plans that avoid the various drawbacks attendant conventional ETL systems.