A database is a collection of stored data that is logically related and that is accessible by one or more users. A popular type of database system is the relational database management system, which includes relational tables made up of rows and columns. Each row represents an occurrence of an entity defined by the table, with an entity being a person, place, or thing about which the table contains information.
To extract data from a relational table, queries according to a standard database-query language (e.g., Structured Query Language or SQL) can be used. Examples of SQL statements include INSERT, SELECT, UPDATE, and DELETE. The SELECT statement is used to retrieve information from the database and to organize information for presentation to the user or to an application program. The SELECT statement can also specify a join operation to join rows of multiple tables. A common type of join operation is a simple join (or equijoin), which uses an equal (=) comparison operator to join rows from multiple tables. Another type of join is a non-equijoin, which is based on operators other than the equal comparison (e.g., >, <, etc.).
Some database systems, such as the TERADATA® system from NCR Corporation, have multiple access modules to provide a massively parallel processing (MPP) database system. An access module manages a predefined storage space of the database and manages access of data stored in the predefined storage space. Typically, in a parallel database system having a plurality of access modules, each table is distributed across the plurality of access modules. Thus, for each table, some rows are stored in storage space associated with one access module, while other rows are stored in storage space associated with one or more other access modules. By distributing the rows of each table among plural access modules, concurrent processing of data in a target table by the access modules can be performed to improve database speed and performance.
In a TERADATA® database system, a primary index is used to assign a row of a table to a given access module. A primary index is defined at table creation. A primary index can be defined to include a single column or a combination of columns. One of multiple access modules is identified by passing a primary index value through a hashing algorithm. The output of the hashing algorithm contains information that points to a specific one of plural access modules that a row is associated with.
To perform a join operation, it is sometimes necessary to redistribute certain rows of one table from a given access module to another access module. Redistribution takes up database bandwidth, with the cost of redistribution being proportional to the size of the rows being distributed. In other words, the larger the redistributed rows, the greater the cost of redistribution.