A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, with the optimal query plan being the one with the lowest cost (e.g., response time) as determined by an optimizer. The response time is the amount of time it takes to complete the execution of a query on a given system.
In massively parallel processing (MPP) systems, dealing with data skew in parallel joins is critical to the performance of many applications. As is understood, a join comprises a structured query language (SQL) operation that combines records from two or more tables. Partial redistribution, partial duplication (PRPD) mechanisms have been proposed for an optimizer to use when joining large tables where data skew is known and the skewed data values are also known. For example, assume a join is to be performed with a large relation R having 10 million rows and a smaller relation S having 1 million rows featuring a join condition R.a=S.b (where a and b are columns of the relations R and S respectively). Assume both R and S are evenly partitioned across all processing modules, e.g., Access Module Processors (AMPs), and neither R.a nor S.b is the primary index, e.g., the values that are hashed to distribute the base table rows to MPP processing modules, of R or S. As referred to herein, hash redistribution comprises generating a hash value of, for example, column or index values of a table and redistributing the corresponding rows to processing modules based on the hash values. Further assume there are 100 AMPs deployed in the MPP system. Without data skew in column R.a (e.g., relation R can be roughly evenly hash partitioned on R.a), the optimizer will hash redistribute rows of R on R.a and hash redistribute rows of S on S.b to join R and S. Each AMP will join about 100,000 rows of R and 10,000 rows of S.
However, consider the case where there is data skew in column R.a. Assume there is a single data value of column R.a which appears in 1 million rows of R. The above redistribution plan will redistribute 1 million rows of R with the skewed value in R.a to a single AMP. In this instance, the AMP having 1 million rows redistributed thereto will have an excessive load with respect to other AMPs involved in the join operation. An AMP featuring an excessive load in such a situation is referred to herein as a hot AMP. In the present example, the hot AMP will have about 1,090,000 rows while all other 99 AMPs will have only about 90,000 rows of R. Consequently, the system performance is degraded and may result in an “out of spool space” error on the hot AMP, which causes queries to abort often after hours of operation in large data warehouses. Out of spool space may occur because, although disk capacity continues to become larger and cheaper, parallel DBMSs may still maintain spool space quotas for users on each AMP for the purpose of workload management and concurrency control.
Contemporary PRPD plans handle the above described data skew problem by keeping the rows of R with the skewed value in R.a locally and duplicating the rows of S that have the skewed value in S.b. However, such a PRPD mechanism assumes the skewed values in R are known before query execution time, e.g., by statistics, sampling, or a full table scan.