This invention relates generally to processing records from a database.
A database is a body of information that is logically organized so that the information can be retrieved, stored and searched by a computer system in a coherent manner by a "database engine"--a collection of methods for retrieving or manipulating data in the database. Databases generally fall into one of several categories, such as relational databases, object-oriented databases and object-relational databases.
A relational database (RDB) is a collection of fixed-field two-dimensional tables that can be related or "joined" to each other in virtually any manner a database developer chooses. The structure of a relational database can be modified by selectively redefining the relationships between the tables. A database engine may perform complex searches on a relational database quickly and easily with a protocol known as the Structured Query Language (SQL). The relationships between the tables enable results of a search automatically to be cross-referenced to corresponding information in other tables in the database. As shown in FIG. 1, for example, a relational database 1 includes a customer table 2 which is joined by a logical link 3 to an order table 4 which in turn is joined by a logical link 5 to an inventory table 6. A user may query the database 1, for example, for all order numbers higher than a threshold value. The list of order numbers resulting from the query can be retrieved and displayed along with the respective customer names and inventory items that correspond to the identified order and numbers because the order table 4 is joined with the customer table 2 and the inventory table 6. As illustrated by this example, a "join" operation builds a relation from two or more specified tables consisting of all possible combinations of tuples, one from each of the two or more tables, such that the tuples contributing to any given combination satisfy some specified condition. Generally, a join operation is performed in response to a query in which data is to be retrieved from more than one table in a database.
Hash operations can be used to simplify and provide fast access to a specific stored record on the basis of a given value for some field in the database. More specifically, a "hash-join" operation can be used to provide a quicker response to a query requiring data retrieval from multiple tables in the database. Consider, for example, a join operation involving two tables, "build" and "probe". Data entries or records from the "build" table, for example, are used to construct a hash table in a computer memory using a hash-function. A hash-function is similarly applied to the appropriate records in the "probe" table. The hashed values from the "probe" table are then used to search for matches in the hash table constructed from the "build" table. When a match is found, the data from the "probe" table is joined with the corresponding data from the "build" table. The matched data is returned to the user as a result of the "hash-join" operation.
In the context of parallel processing, various operations executed during a hash-join operation can be performed in parallel by executing multiple instances of the various operations simultaneously on one or more processors. For example, the operator, or software module, producing the data for the "build" side, the operator producing the data for the "probe" side, and the hash-join operator can be performed in parallel.
Ideally, each bucket in the hash table of each instance of the hash-join operator receives approximately the same amount of data. However, this ideal distribution may not occur because the distribution of data values in the "build" and "probe" tables is skewed, or unevenly distributed. For example, if all the entries in a particular column of the table "build" are identical, then all the records from the build-table will be sent to the same instance of the hash-join operator, leaving the remaining instances of the hash-join operator idle, and resulting in a waste of available processing power.
Several types of skewing of data can be identified: (1) "build" skew; (2) "probe" skew; and (3) "join" skew. Build skew indicates that the data used to construct the hash table is unevenly distributed, or, in other words, that the data from the build table is unevenly distributed. Probe skew indicates that the data from the probe table is unevenly distributed. Finally, join skew indicates that the output data produced by the hash-join operation is unevenly distributed. Join skew can occur even if the input data from the build and probe tables is evenly distributed.