1. Technical Field
The present invention is directed to a system and method for data extract, transform and load (ETL) in a data warehouse environment. More specifically, the present invention provides a command conversion mechanism for dynamically converting ETL commands that operate as a single transaction into a plurality of ETL commands that operate as a plurality of transactions for enhancement of the performance of ETL execution and a system that supports such a mechanism.
2. Description of Related Art
Data warehousing is the process by which data obtained by an operational system is stored in one or more databases separate from the operational system so that this data may be used for business analysis apart from the operational data in the operational system. There are many reasons why the data stored for business analysis is separated from the operational data. First, data may be integrated into a data warehouse from more than one operational system. In this way, data from multiple sources may be cross-referenced so that information regarding the correlation between data may be obtained from the data warehouse.
Second, the operational system's performance may be degraded because of analysis processes that are performed on the operational data. By separating out the data used for business analysis, the analysis operations may be performed separately from the operational aspects of the operational system and thus, a degradation in performance of the operational system is avoided.
Third, data that is stored in the data warehouse is typically non-volatile. That is, the data is not stored in the data warehouse until the data is most likely in its final state and is not going to be modified. For example, the order status does not change, the inventory snapshot does not change, the marketing promotion details do not change, etc. This allows the user that is performing business analysis to rely on the accuracy of the data stored in the data warehouse since it is unlikely that the data has been changed since it was stored in the data warehouse.
Fourth, data from most operational systems is archived after the data becomes inactive. Since the data warehouse is designed as the archive of the operational data, the data is stored for a very long period of time compared to the operational data in the operational system. The cost for maintaining the data in the data warehouse is minimal. As a result, data archived in a data warehouse may be retrievable at remote times from when the data was first obtained.
The data stored in the data warehouse is stored in a structured manner such that the data may be subjected to query and analysis. Typically, this structure takes the form of one or more database tables having records that are structured in a specific manner. These database tables may be queried and analyzed using database query and analysis commands such as are generally known in the art. For example, the Structured Query Language (SQL) is a language used to interrogate and process data in a relational database. Originally developed by International Business Machines, Inc. (IBM) for its mainframes, all database systems designed for client/server environments support SQL. SQL commands can be used to interactively work with a database or can be embedded within a programming language to interface to a database. Programming extensions to SQL have turned it into a full-blown database programming language, and all major database management systems (DBMSs) support the language.
Building a data warehouse involves taking the data from the operational system and storing it in the data warehouse in a structured manner. The process used for moving data from the operational system to the data warehouse is referred to as “extraction, transformation and loading” or ETL. To populate a data warehouse table, the ETL process typically makes use of combined SQL statements such as select-insert, select-update, and the like. These SQL statements are performed in a single transaction within the operational system.
In data warehouse environment, the size of the data to be transferred is usually very large. When SQL statements, such as select-insert and select-update, are performed as a single transaction, a large amount of log space and execution time is required to maintain data about the transaction in order to allow for rollback of the transaction. The performance of the execution usually will be downgraded. Thus, it would be beneficial to have a system and method for optimizing the execution of SQL statements such that the performance of the SQL execution is increased and the log space and execution time is minimized.