Data warehouses are central databases systems that store data from many external data sources. Data warehouses store data from other databases, servers, work stations, storage disks or other storage mediums. This kind of database system provides a central location for operators and users to analyze data that would otherwise be scattered across multiple, discrete storage mediums. For example, data warehouses enable users to study historical data for purpose of running business logic or models. The historical data may have, at one time, been recorded on individual computers that could not easily share large amounts of stored information with one another.
Extraction, transformation and loading (ETL) operations are performed to move data from scattered external sources into a target database system. Specifically, an extraction operation is executed to retrieve data from one or more data sources. A transformation operation is used to alter the extracted data, and may involve performing conversions, filtering, aggregations, and/or other operations on the data. A loading operation is used to move the translated data into the target database.
When loading data into a data warehouse, the loading performance can be compromised by the amount of data already existing in the database. For example, the amount of time it takes to load the same amount of data into a database increases as the amount of data already existing in the database increases, until the database system's loading performance becomes noticeably poorer after several loading operations have already been performed.
Database systems typically store data in tables. One or more indexes may be built on a table to facilitate fast access to data within the table. Operations that retrieve or manipulate data within a table may first use an index for the table to determine which data items (e.g. rows) in the table are involved in the operations. Although beneficial to queries, indexes are usually pure overhead during loading of new data.
Over time, the tables in the database system become larger as a result of collecting more and more data. As a table becomes larger, the indexes built on the table also become larger. Thus, the deterioration of loading performance is at least partially the result of the database system trying to manage one or more indexes for the table into which new data is being loaded.
Database systems typically use variations of B*-Tree indexes to manage tables. This type of index must be continuously rebalanced and reorganized to account for new data that is added to the table from external data sources. Metadata that identifies data items in the table form nodes that are arranged to correspond to columns or other portions of the table. This metadata is used to form nodes for the index. The B*-Tree index arranges the nodes so that there is a top branch node, subsequent branch nodes, and many leaf nodes. One feature of B*-Tree indexes, when used in traditional database systems and databases, is that when new entries are inserted into this type of index, routines are automatically executed to “balance” the index so that the top branch node is centrally positioned, with equal number of branches to that node's right and left. Leaf nodes to any branch node are similarly distributed, so that each branch node is balanced. Furthermore, the B*-Tree index monitors the depth of the leaf nodes, so that retrieval of any data item from anywhere in the table takes approximately the same amount of time.
There is another type of index structure, known as bitmap index. The maintenance of bitmap indexes in the event of adding new data is even more expensive than balancing B*-Tree indexes.
It is more difficult for the database system to manage the index as the index becomes larger. The index becomes larger by adding nodes and increasing in width and depth. In typical situations, the amount of data existing in a data warehousing system is usually much larger than the amount of data being inserted into it, and the loading performance of a table is noticeably affected by the size of the indexes of the table.
Data in tables of a database system must usually satisfy certain conditions, known as constraints. Examples of constraints are that data of some columns must be unique (unique constraint), data of some columns must have appeared in another table (referential constraint), or more general conditions (check constraint).
Tables can be stored in partitioned or monolithic forms. Each partition of a partitioned table is associated with a partition condition which specifies what data can reside in that partition. Each partition of a partitioned table is physically stored like a monolithic table, but logically it is a member of a larger table. The logical membership relationship is recorded by the database system.
In a partitioned table, all constraints are further categorized into global constraints and local constraints. Global constraints are conditions that must be satisfied by data items in the table without regard to partitions. Local constraints are conditions that must be satisfied by all data items in a particular partition of the table. In a data warehousing system, constraints are usually local, or can be easily converted to local constraints.
If new data is loaded into a table, traditional database systems write the new data to one or more partitions in the table. During the loading of new data, the entire table is checked for satisfaction of both local and global constraints. If new data is written to only one partition, database systems still check the entire table for satisfaction of all local and global constraints. If the new data causes one of the constraints to fail, the loading of the new data is undone.
A data warehousing system is normally a central database in a business unit, which may range from a mid-sized department to a large enterprise. The central database collects subject oriented (such as sales, products, or customers, etc.) data from distributed locations of the business unit. The scattered information is integrated at the central database for queries by business analysis software. To serve the business analysis, a data warehouse normally must retain historical data. In other words, once collected, data normally stay in the data warehouse for a long period of time without being changed or purged. At regular business closings, new data need to be loaded into the database. As a result, the database size increases over time. Partitioning is a traditional method to organize the large amount of data into more manageable chunks. It is also traditionally used to help queries to prune unwanted data. But the loading of new data still suffer from index and constraint overhead. The application describes a method that enables the partitioning technique to also help speed up the data loading process in a data warehousing environment.