Many businesses and other large organizations today use relational database management systems to execute and keep track of business transactions. These systems are known as on-line transaction processing (OLTP) systems. For example, a company that sells products or services over the Internet may use an OLTP system to record pricing information about each product for sale, billing and shipping information for each purchaser, and sales information for each order made by a purchaser. Other examples of businesses that use OLTP systems include airlines, banks, mail-order companies, supermarkets, and manufacturers. The performance of OLTP systems, e.g. the number of transactions per second, is vitally important for businesses, and it is therefore desirable not to degrade the performance of OLTP systems unless absolutely necessary.
The data generated and recorded in OLTP systems are valuable to most businesses, because the businesses can aggregate and analyze the data to ascertain the product sales for a particular month, forecast changing trends in product popularity and identify profitable or unprofitable product lines, or otherwise evaluate the businesses' affairs. Aggregating and analyzing this data, however, is computationally expensive and, if performed on the OLTP system itself, would decrease the performance of the OLTP system. Accordingly, it has become common for businesses with OLTP systems to set up a separate computer system, generally known as a “data warehouse,” for the purpose of collecting, aggregating, and analyzing the information contained in the OLTP databases. Data warehouses can grow very large, ranging from gigabytes to many terabytes of data (trillions of bytes). The task of moving data from its original source in OLTP systems to the data warehouse is commonly referred to as data extraction, transport, and loading (ETL).
Conventional data extraction, transport, and loading mechanisms are cumbersome. In a typical approach, database administrators generally dump the entire contents of the tables in the OLTP system into flat files, transport the flat files to a staging system, and then load the data in the flat files into the data warehouse. In this approach, the amount of data extracted, transported, and loaded is as immense as the amount of data in the OLTP system, even though only a small percentage of the data on the OLTP system is actually new. Accordingly, there has been much interest in devising ways to reduce the amount of data extracted, transported, and loaded by capturing only the changed data to the database tables of the OLTP system.
A typical approach for capturing the changed data to OLTP system database tables is to add a column to the OLTP system database tables to store a timestamp or a sequence number and conditionally extract the data that has the newest timestamps or sequence numbers. This approach has several disadvantages. First, this approach requires a change to the schema, e.g. adding the extra column to hold the timestamp to track the changes. Not only is this schema change an administrative nightmare, many vendors of OLTP systems forbid their customers from making any changes to the schema for security, maintenance, and liability reasons.
Second, there is a performance hit in storing the timestamp for every row of new or changed data, yet performance is critical for OLTP systems. Third, while timestamps can easily identify which rows have changed due to an insert or update, timestamps cannot distinguish between a newly inserted row or an old row that was updated. Furthermore, timestamps cannot identify deleted rows, because deleted rows are no longer present in the database. The lack of this kind of change information makes it difficult to properly update summaries of the OLTP data, resulting in expensive recalculations of the summary data.
The conventional flat file approach is also disadvantageous for data transport. Flat files are outside the control of the database management system. Thus, flat files cannot take advantage of the sophisticated file management functions provided by modern database management systems, such as crash recovery for robust data protection and data partitioning and indexing for efficient data processing. Consequently, storing the changed data in flat files increases the risk that data will be lost or corrupted and greatly complicates the task of joining or combining the changed data with the data that is stored in the data warehouse for use by the data analysis applications.
In terms of data loading, the conventional flat file approach also suffers from data loss and double counting. In the flat file approach, there is no coordination between the process that updates the flat file with additional changed data and the applications that consume and analyze the changed data. When the flat file grows too large for the data warehouse system, the system administrator is likely to delete the oldest data in the flat file arbitrarily. If a data analysis application has not yet processed the deleted data, that data is permanently lost to the application, resulting in inaccurate summaries of the OLTP data. On the other hand, if the data analysis application needs to make multiple passes on the flat file to process the data, the data analysis application cannot identify the data that has been newly added to the flat file. Typically, the data analysis application is forced to reprocess all the data in the flat file, resulting in double counting of the old data.
Therefore, there is a need for data extraction, transport, and loading techniques that are efficient, do not require schema changes, are robust, and do not suffer from data loss or double counting problems.