A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
One of the goals of a database management system is to optimize the performance of queries for access and manipulation of data stored in the database. Given a target environment, an optimal query plan is selected, with the optimal query plan being the one with the lowest cost, e.g., response time, as determined by an optimizer. The response time is the amount of time it takes to complete the execution of a query on a given system.
In massively parallel processing (MPP) systems, dealing with data skew is critical to the performance of many applications. As is understood, a DISTINCT query comprises a structured query language (SQL) operation that returns results without duplicate values on an attribute. In contemporary MPP systems, a DISTINCT algorithm hash redistributes rows of a table on each processing module, such as an access module processor (AMP), based on the value in the column on which the DISTINCT keyword is applied. After the hash redistribution, each processing module removes duplicate values on the column on which the DISTINCT keyword is applied. Such an algorithm causes a system bottleneck in the presence of data skew when one or more highly skewed values appear many times in the particular column. Data skew occurs in many types of situation, such as natural demographic data skew, skew resulting from null values, or various other causes.