In Massively Parallel Processing (MPP) systems, Business Intelligence (BI) and Enterprise Data Warehouse (EDW) applications process massive amounts of data. The data (a set of relational tables) resides in very large database systems that rely on a large number of central processing units (CPUs) to efficiently execute database operations. MPP systems apply the divide-and-conquer approach of evenly distributing the data among the available processors and then performing the intended operation in parallel, instead of performing the operation serially.
One of the basic and most common database operations is the join between two relational tables. The join operator combines the records from both tables based on a matching criterion between columns in the tables. For example, the table LINEITEM can be joined to table PRODUCT by matching product_id columns on both tables to get a set of all line items with their product information. The join operation is often the most computationally expensive operation in the query execution tree, and its performance dictates the overall performance of the query.
To perform the join operation efficiently in parallel, the system partitions the data stream from both tables based on the value of the join column (product_id in the example above). That is, all records that have the same value of the join column from either table, or child, of the join are guaranteed to be sent to the same central processing unit (CPU). Hence, all join matches can be found locally in each CPU and independently of the other CPUs.
FIG. 1 is a block diagram of how the join operation is parallelized using a partition-by-value scheme. Data from both sides of the join (Table A and Table B along the bottom of FIG. 1) is read and redistributed to the n join instances (along the top of FIG. 1). Each join instance is running as part of an Executor Server Process (ESP) hosted on a separate CPU in the MPP system. The data is repartitioned based on the value of the join column(s) to ensure the correctness of the parallel join result. Specifically, the hash value of the join column(s) is used to select the join instance.
This partition-by-value scheme works well when records are distributed uniformly. The use of a good hash function ensures that distinct values are distributed uniformly (or pseudo-randomly) to all processors. However, a good hash function does not guarantee that records are distributed evenly since not all distinct values have the same occurrence frequency in the data set. The problem becomes evident when one value has an occurrence frequency higher than the average number of records per CPU. This is called data skew or skew. In the case of such data skew, the CPU selected by the frequent value will process a significantly higher number of records than average, which would significantly degrade the query response time.
The effect of skew on the partition-by-value scheme is demonstrated by FIG. 2. FIG. 2 is a block diagram of the location of the column values in the join instances after a partition-by-value distribution. Values from both sides of the join (Table A and Table B) are hashed to one of 128 parallel join instances (Join 0-Join 127). For demonstration purpose the simple hash function of [hash(v)=v mod 128] is used. All join column values, except for v=1, are uniformly distributed and hence their corresponding rows are uniformly partitioned (i.e. each join instance has approximately the same number of join column values). The value of v=1, on the other hand, has very high frequency. Since all rows with v=1 are sent to the second join instance, the second join instance will process considerably more rows than other join instances in the scheme. This is regardless of the quality of the hash function used. Because the second join instance has considerably more rows than other join instances in the scheme, it will take longer for the second join instance to complete its processing.
Data skew is inherent in certain types of data (e.g. 40% of a sales table records have country_id =‘United States’), or could result from the data collection process (e.g. 20% of the sales table records have customer_id=‘Unknown’). In either case the database system must be capable of performing efficient queries on such data. The skew problem is important for MPP systems because the larger the system is, the more likely it will suffer from data skew problems. For example, a value that occurs in 11% of the records of a large table will slow down its join operations on an 8-way system by 2×, while the same task will slow down join operations on a 256-way MPP system by 30×.