A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, with the optimal query plan being the one with the lowest cost (e.g., response time) as determined by an optimizer. The response time is the amount of time it takes to complete the execution of a query on a given system.
In massively parallel processing (MPP) systems, dealing with data skew in parallel joins is critical to the performance of many applications. As is understood, a join comprises a structured query language (SQL) operation that combines records from two or more tables. Assume a join is to be performed in parallel by one hundred processing modules, e.g., Access Module Processors (AMPs), with a large relation R having 10 million rows and a smaller relation S having 1 million rows featuring a join condition R.a=S.b (where a and b are columns of the relations R and S). Assume both R and S are evenly partitioned across all processing AMPs and neither R.a nor S.b is the primary index, e.g., the values that are hashed to distribute the base table rows to MPP processing modules, of R or S. As referred to herein, hash redistribution comprises generating a hash value of, for example, column or index values of a table and redistributing the corresponding rows to processing modules based on the hash values. Without data skew in column R.a (e.g., relation R can be roughly evenly hash partitioned on R.a), the optimizer may hash redistribute rows of R on R.a and hash redistribute rows of S on S.b to join R and S. Each AMP will join approximately 100,000 rows of R and 10,000 rows of S.
However, consider the case where there is data skew in column R.a. Assume there is a single data value of column R.a which appears in 1 million rows of R. The above redistribution plan will redistribute 1 million rows of R with the skewed value in R.a to a single AMP. In this instance, the AMP having 1 million rows redistributed thereto will have an excessive load with respect to other AMPs involved in the join operation. An AMP featuring an excessive load in such a situation is referred to herein as a hot AMP. In the present example, the hot AMP will have about 1,090,000 rows while all other 99 AMPs will have only about 90,000 rows of R. Consequently, the system performance is degraded and may result in an “out of spool space” error on the hot AMP, which causes queries to abort often after hours of operation in large data warehouses. Out of spool space may happen because, although disk capacity continues to become larger and cheaper, parallel DBMSs still maintain spool space quotas for users on each AMP for the purpose of workload management and concurrency control.
Alternatively, the optimizer may choose to duplicate the rows of the relation S on every AMP where R resides and not redistribute the rows of R. Such a mechanism is referred to as table duplication. Table duplication may result in poor performance for systems featuring a large number of AMPs. For example, if there are one hundred AMPs involved in the join operation, the system is required to duplicate 100 million rows of the relation S to perform the join operation. Assume the size of the relation S is 200 MB. In this instance, 20 GB of spool space is disadvantageously required for the duplication of the relation S. Further, with a large number of AMPs, the duplication process may take an excessive duration to complete because only a single broadcast may be performed at any point of time.