A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, with the optimal query plan being the one with the lowest cost (e.g., response time) as determined by an optimizer. The response time is the amount of time it takes to complete the execution of a query on a given system.
In massively parallel processing (MPP) systems, dealing with data skew in parallel joins is critical to the performance of many applications. As is understood, a join comprises a structured query language (SQL) operation that combines records from two or more tables. Contemporary parallel database systems provide for the distribution of data to different parallel processing units, e.g., Access Module Processors (AMPs), by utilizing hash redistribution mechanisms. When joining two or more relations, e.g., relations “R” and “S”, by join conditions such as R.a=S.b, rows in both tables with the same join column values need to be relocated to the same processing unit in order to evaluate the join condition. To achieve this, contemporary systems typically implement one of two options.
Assume R and S are partitioned across various processing units and that neither R.a nor S.b are the primary index, e.g., the values that are originally hashed to distribute the base table rows to the processing units. The MPP optimizer may hash redistribute rows of R on R.a and hash redistribute rows of S on S.b. By using the same hash function, rows with the same join column values are ensured to be redistributed to the same processing unit. The optimizer will then choose the best join method in the local processing unit, e.g., based on collected statistics or other criteria. Such a parallel join mechanism is referred to herein as redistribution.
Redistribution is typically efficient when the rows are sufficiently evenly distributed among the processing units. However, consider the case where there is highly skewed data in column R.a and/or S.b. In this situation, a processing unit will have an excessive load with respect to other processing units involved in the join operation. A processing unit featuring an excessive load in such a situation is referred to herein as a hot processing unit. Consequently, the system performance is degraded and may result in an “out of spool space” error on the hot processing unit which may cause, for example, queries to abort after hours of operation in large data warehouses.
Alternatively, the optimizer may choose to duplicate the rows of one relation among the processing units. For example, assume the relation R is much larger than the relation S. In such a situation, the rows of R may be maintained locally at each processing unit where R resides, and the rows of S are duplicated among each of the processing units. Such a mechanism is referred to as table duplication. By this mechanism, rows with the same join column values will be located at the same processing unit thereby allowing completion of the parallel join operation. However, efficient performance utilizing a duplication mechanism requires for one relation to be sufficiently small to allow for duplication on all the parallel units.