A computer database is a structured collection of records or data that is stored in a computer system. A database relies upon software to organize the storage of data. The software models the database structure in a database model, such as a relational model, a hierarchical model or a network model.
A database transaction is a unit of work performed against a database. A database transaction is treated in a coherent and reliable way independent of other transactions. By definition, a database transaction must be atomic, consistent, isolated and durable. These properties are often referred to by the acronym ACID. The concept of a database transaction can be illustrated in connection with a double-entry accounting system. In a double-entry accounting system each debit requires an associated credit to be recorded. For example, if a purchaser writes a $20 check to a retailer, the transactional double-entry accounting system would record two entries: a $20 debit of the account of the purchaser and a $20 credit to the account of the retailer. In a transactional system, both entries would be made or both entries would fail. By treating the recording of multiple entries as an atomic transactional unit of work, one maintains the integrity of the data recorded.
Periodically, a “snapshot” of transactional data in a database is captured and loaded into a data warehouse. A data warehouse is a repository of data that is designed to facilitate reporting and analysis. In contrast, transactional database systems are optimized for the preservation of data integrity and the speed of recording business transactions. Data warehouses are optimized for speed of data retrieval. Frequently, the form of data in a transactional database is altered for storage in a data warehouse.
Extract Transform and Load or ETL is a process that involves extracting data from a source system (e.g., a transactional database), transforming it to fit business needs, and loading it into a target (e.g., a data warehouse). ETL tools are commercially available, such as Data Integrator™, which is sold by Business Objects™ an SAP™ Company located in San Jose, Calif.
Existing ETL tools perform strongly when initially populating a target with source data. Updating the target in response to changes in the source is called a delta load. Delta loads create challenges. Change Data Capture or CDC refers to software implemented methods used to identify and process source data that has changed. CDC is commonly associated with data warehouses because capturing and preserving the state of data is a core function of a data warehouse. CDC commonly relies upon a source system indicator of some kind, such as a timestamp, a database transaction log, and the like. A transaction log is typically a file of database updates stored in a reliable manner. In other words, a transaction log stores a sequence of data modifications associated with a database management system.
Techniques to construct ETL tasks are known. Similarly, techniques to define delta load tasks are known. Unfortunately, these techniques can be labor intensive and error prone. For example, many delta load tasks require the construction of multiple data flows, which can be time consuming and result in errors. Existing delta load techniques fail to leverage the information associated with ETL tasks.
Therefore, it would be desirable to leverage information associated with a defined ETL task to automatically specify a delta load task. To achieve this, it is necessary to automatically develop an appropriate strategy for a delta load task in view of system resources.