A database system stores collections of logically related data. One common type of database system is the relational database management system (RDBMS), which stores data in tables (or relations) that are arranged as rows and columns. To access data in a database system, requests according to a standard database query language are submitted to the database system to retrieve, update, or delete data, as well as to manipulate various objects in the database system, including tables, definitions of tables, and so forth.
As computer technology has progressed, database systems are able to store increasing amounts of data. In some cases, parallel database systems are used to store and process massive amounts of data. A parallel database system typically includes multiple processing units that enable the concurrent access and manipulation of data within the database system. One example type of such a parallel database system is the TERADATA® database system from NCR Corporation.
One of the issues associated with a parallel database system is the partitioning of data across multiple processing units. There are various types of partitioning schemes, including hash partitioning and range-based partitioning. Hash partitioning refers to storing a particular row in a particular processing unit based on a hash value associated with the row. This hash value is typically generated by applying a value of an attribute (or multiple attributes) within the row through some hashing algorithm to derive a hash value. Each processing unit can be associated with one or more hash values, such that a row associated with a given hash value is stored in the corresponding processing unit.
A range-based partitioning scheme involves assigning a range of attribute values to each processing unit, such that a row that contains an attribute within that range is stored in the corresponding processing unit. The range-based partitioning scheme usually works well if (1) there are sufficiently many unique values in the data, and (2) there is generally an equal number of rows for each data value. If either of these conditions is not met, then range partitioning results in uneven distribution of rows across the processing units. If uneven distribution of rows occurs, the amount of data to be processed by at least one processing unit is substantially greater than the amount of data to be processed by other processing units. As a result, the load across the processing units will become unbalanced, which may result in reduced parallelism of a database system.