In a relational database management system (DBMS), a hash-based join operation (referred herein as “hash join”) is used to join larger data sets. Hash join is particularly effective when a query requests a join operation that joins a table with a large data set with a table that has considerably smaller data set. Such queries are frequent as databases generally employ “snowflake” schemas where a large data set fact table is related to (and thus, frequently joined with) multiple much smaller build-side tables, which further describe various dimensions of entries in the probe-side table.
Regardless of whether a snowflake schema is employed by a database, the DBMS may use the smaller of two tables in a join operation to build a hash table in fast-access memory to perform a hash join. Such a joined table is referred herein as a “build-side” or “dimension” table, and this phase of a hash join is referred to herein as a “build phase.” For the larger of the joined tables, referred to herein as a “probe-side” or “fact” table, the DBMS scans the table, probing the hash table based on comparing the hash values in the hash table with newly generated hash values from the scan of the probe-side table. This phase of a hash join is referred herein as a “probe” phase.
The probe phase of a hash join operation may be resource heavy. During the probe phase, the DBMS, for each row in a large data set, performs a) a hash value computation on a join key value(s) of the row, b) matching the hash value with the hash table of the build phase, c) retrieves the rows of the build-side table corresponding to the matched hash values and d) generates the result set.
Computing a hash value for millions of rows in a probe-side table can be very resource intensive. Other resource-heavy operations, such as additional comparisons of join key values and retrieving the matched build side row information, are also performed per each row of the probe-side table. Performing all these operations on millions of values becomes extremely taxing even for the most resourceful computing systems.
Other query operations, such as group-by and expression evaluations, have a similar performance bottleneck when used on large data sets. For example, for the group-by operation, a similar hashing of every value may be necessary to arrange the rows according to the group-by key values. Alternative to hashing, sorting according to the group-by key values, may be as taxing on computing resources as arranging based on hashing. The problem is further exacerbated with the group-by operation because the group-by operation is a blocking operation in query processing. Other operations have to necessarily wait for the group-by operation to complete in order for the other operations to execute, further delaying the execution time of the query.
One approach to expedite the operations, is to partition the probe-side table and perform the operations in parallel on respective partitions. Although such an approach may save query processing time, the amount of resources committed for hash computation does not change. Additionally, such parallel processing may starve other queries from executing in parallel.
A similar approach is to allocate much greater fast-access main memory to query operations on tables with large data sets and to fit more data into the volatile memory. Main memory typically comprises volatile memory, which is becoming cheaper and larger, thereby allowing more data to be cached from disk storage to volatile memory. Such caching allows fast access to the data, and for a DBMS that uses the data, speedier performance of the work. However, the amount of data that is typically in larger tables has also significantly increased. Particularly, in order to completely cache larger (probe-side) tables in volatile memory, the DBMS would require an exuberant amount of volatile memory. Thus, regardless of the size of volatile memory, there may still exist data (and in some cases a signification portion of data) that could not be concurrently cached in the volatile memory. Such data would necessarily be accessed from disk storage and loaded into cache on an as-needed basis (replacing other data in the cache).
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.