In Massively Parallel Processing (MPP) systems, Business Intelligence (BI) and Enterprise Data Warehouse (EDW) applications process massive amounts of data. The data (a set of relational tables) resides in very large database systems that rely on a large number of central processing units (CPU) to efficiently execute database operations. MPP systems attempt to evenly distribute the data among the available processors and then perform the intended operation in parallel, instead of performing the operation serially.
One of the basic and most common database operations is the join between two relational tables. The join operator combines the records from both tables based on a matching criterion between columns in the tables. For example, the table LINEITEM can be joined to table PRODUCT by matching product_id columns on both tables to get a set of all line items with their product information. The join operation is often the most computationally expensive operation in the query execution tree, and its performance dictates the overall performance of the query.
To perform the join operation efficiently in parallel, the system partitions the data stream from both tables based on the value of the join column (product_id in the example above). That is, all records that have the same value of the join column from either table, or child, of the join are guaranteed to be sent to the same central processing unit (CPU). Hence, all join matches can be found locally in each CPU and independently of the other CPUs.
This partition-by-value scheme works well when records are distributed uniformly. The use of a good hash function ensures that distinct values are distributed uniformly (or pseudo-randomly) to all processors. However, a good hash function does not guarantee that records are distributed evenly since not all distinct values have the same occurrence frequency in the data set. The problem becomes evident when one value has an occurrence frequency higher than the average number of records per CPU. This is called data skew or skew. In the case of skew, the CPU selected by the frequent value will process a significantly higher number of records than average, which may significantly degrade the query response time.