Aggregation and duplicate removal are common operations performed on data stored in databases. Data in a relational database management system (RDBMS) is aggregated in response to a query, such as a SQL query, that includes an aggregation function (e.g., SUM, COUNT, AVG, etc.) with a GROUP BY clause. Duplicate records are removed from data in a RDBMS in response to a SELECT DISTINCT clause. Data is either aggregated on the fly or is pre-computed in materialized views or tables for later use. Data aggregation is a critical operation in data warehousing and OLAP environments, as a large volume (e.g., terabytes) of the data needs to be aggregated and, typically, in a fixed time period. Because of its widespread and frequent use, performing data aggregation and duplicate removal operations as efficiently as possible is critical.
Aggregation and duplicate removal operations require bringing together records having the same key values. In some database systems, these operations are performed using a sort-based algorithm. Sort-based algorithms typically employ a sort based on an in-memory (i.e., in system memory) insertion tree, to find the matching records, i.e., records with the same values for key columns. These matching records are either aggregated or eliminated based on whether the operation is aggregation or duplicate removal, respectively. Each input record is looked up in the insertion tree to determine whether or not there is already a matching record in the tree. If there is already a matching record in the in-memory tree, then, when the operation being performed is aggregation, the input record is aggregated with the existing record. For duplicate removal, the input record is ignored because the input record is a duplicate of the matching record.
Performing a lookup operation in the insertion tree involves traversing the tree, and performing as many comparisons as there are levels in the tree. The complexity of such an algorithm is O(N log M), where N is the number of input records and M is the number of output records. If there is not enough system memory available to complete the entire operation, then the contents of the in-memory insertion tree get flushed to persistent storage (e.g., disk) and contribute to what is referred to as a sorted run. These sorted runs are then merged in stages to finish the sorting and aggregation process. The number of merge stages (also referred to as merge passes) required is logarithmic to the number of sorted runs, with each merge pass requiring a read and write of data to disk. Thus, the sort-based aggregation algorithm has logarithmic complexity and is computationally expensive.
Another disadvantage with sort-based algorithms is that results cannot be returned to the user until the sort is finished. Consequently, the query response time is higher than with alternative algorithms.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.