A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information. Within large corporations or organizations, a database system known as an enterprises data warehouse, may contain close to a petabyte of critical data, organized into hundreds of tables, used by many thousands of persons, performing tasks across all business or organization functions. To perform essential functions, it must operate efficiently and reliably every second of every day.
In-memory processing capabilities have recently been implemented within database systems, where data is stored and processed in CPU memory, offering much faster processing times than systems and applications limited to processing data in non-volatile or persistent storages, e.g., Hard Disk Drives (HDDs), Solid State Disk Drives (SSDs), and Flash memory.
Within relational database systems, a join operation is executed to combine records from two or more tables. A hash join is one form of join well suited to in-memory processing. In a hash join, one or both tables to be joined are fit completely inside CPU memory, with the smaller table being built as a hash table in memory, and potential matching rows from the second table are searched against the hash table.
An uneven distribution of data within a database system, referred to a skew, can result in significant performance impairments to database system operations. For example, in a hash join, a single value with a large number of occurrences will cause a long chain in the hash table. Another value with few occurrences, but with a hash value such that it hashes to the previous large value in the hash table will degrade join performance for probe table rows that match the value with few occurrences.
In massively parallel processing (MPP) systems, dealing with data skew in parallel joins is critical to the performance of many applications. An MPP system is a collection of units of computations working together on a single problem. The units of computations are called units of parallelism (UoPs).
The most effective and the most common mechanism for distributing rows in an MPP system is a hash-distribution technique. In this technique, the unit of parallelism in which the data must be stored, i.e., the target, is chosen based on the hash of the data value. This means all rows with the same data value go to the same target unit of parallelism. Different data values, but with the same hash value, will also end up in the same unit of parallelism. In a large system, with a large number of rows, this technique leads to a very even and uniform distribution of rows to all the units of parallelism.
A query in an MPP system is only as fast as the slowest unit of computation. Data skew creates more data in a single unit of computation. This requires one unit of parallelism to perform more work thereby making it slower than the other units.
There are two problems with skew in an MPP system. One is detecting skew and another is handling it for specific database operators. Described below is an improved technique for detecting and handling skew, particularly in hash join operations.