Database management systems (DBMS's) are often designed to maintain large amounts of data. The amount of data stored is so large that the databases storing the information may colloquially be referred to as “data warehouses.” Although the data stored in a data warehouse is large and describes a variety of entities, events, etc., the data items of the data may be very related.
To efficiently store large amounts of related data in a data warehouse, the DBMS may employ star/snowflake schema for the databases. The star/snowflake schema relates a set of factual data (sale transactions, quote requests) to analytic dimensions (customers, time, locations, merchandise types). Based on the schema the factual data is stored in “fact tables” that have one row per item, and the analytic dimensions are stored in “dimension tables,” which describe multiple characteristics of items. Such a schema saves storage space in the data warehouse by eliminating repetition of the characteristics information common to multiple items for each of those items.
With time, as new facts are collected, factual data in a data warehouse tends to grow much faster than analytic dimension data, making fact tables generally much larger than dimension tables (having less rows than fact tables). The size and other data statistics as well as the database schema definition itself may be used by the DBMS to determine which tables are fact tables and which tables are dimension tables.
Dimension and fact tables are commonly joined by queries for business analysis. The combined data of the dimension and fact tables is aggregated to give the user a high level view of business performance. For example, sales might be aggregated by month and by the customer's home state.
Aggregation queries that are run against data that is organized in a star/snowflake schema typically specify a join key, an aggregate measure and a grouping key. The join key determines which row in a dimension table should be combined with a given row in the fact table, while the aggregate measure and grouping key determine which rows and columns are aggregated. For example, in    “select D.state, SUM (F.amount)    from F, D    where F.city_id=D.city_id    group by D.state” query, the “city_id” columns are join keys, the “amount” column is an aggregate measure and the “state” column is a grouping key.
The overall execution time is dominated by the processing of joins and aggregation. The query specifies that each row in fact table F should be joined with the row in dimension table D that has the same value for “city_id” as the fact table row. Further, once the rows from the fact table are joined to the appropriate rows in the dimension table, the rows should be divided into groups based on state, and the “amount” column values for each unique state are aggregated by summation.
One approach for executing such queries is to perform a hash join using hash tables. A “hash join” operation, generally speaking, comprises two phases. In the first phase, known as the “build phase,” the DBMS generates a hash table by hashing each row of the first table according to a hash function on the join key column(s). In the second phase, known as the “probe phase,” the DBMS then iteratively scans through each row of the second table. For each row in the second table, the DBMS uses the hash function and the hash table to identify rows in the first table with equivalent join key values. When matching rows are identified, the rows are merged and added to the result set for the join operation, assuming the rows also match any applicable join predicates. One technique for performing hash join operations is described in U.S. Pat. No. 8,825,678, issued Sep. 2, 2014, the contents of which are incorporated herein by reference.
For a large table, using hash tables to perform join and aggregation operations in this manner is inefficient as it requires a significant amount of memory for hashing numerous rows of large tables. Therefore, database systems that contain large amounts of information suffer significant performance degradation when performing join and aggregation operations using the hash join.