Aggregation functions such as sums, counts, minimums, maximums and averages are often used as foundational analysis tools for data sets. For example, aggregation functions may be used on sales data to extract information and statistics such as total sales grouped by month, order counts grouped by day, or average order values for each customer. This information can be utilized by a company to track sales, evaluate policy, develop marketing strategy, project future growth, and perform various other tasks.
In a database context, aggregation operators such as sum, count, minimum, maximum and average can be calculated for a given set of records, which can be further grouped according to one or more key values such as order date or customer number. The desired aggregation operators and groupings can be specified in a database query, such as a SQL query.
Given the importance of aggregation functions for data analysis, providing a quick result for aggregation queries is often a key database performance metric. To answer aggregation queries in an accelerated fashion, the aggregation query can be formulated as a parallel operation when creating a query execution plan for execution by the database software.
In a typical parallelization approach, data to be aggregated may be processed by multiple record source processes, which distribute by key values to multiple aggregation operator processes. However, the distribution of the records to the aggregation operators introduces a significant bottleneck as the distribution may occur over a bandwidth-restricted and relatively high latency network connection. Bottlenecks may also arise from transfer overhead when shared memory or disk structures are used. These bottlenecks become especially acute when processing large numbers of records. Additionally, the above approach breaks down for records with skewed key values that distribute to a single process, precluding an effective parallelization of the workload.
Based on the foregoing, there is a need for developing a query execution plan that can process aggregation operators in a highly efficient and scalable fashion that can also accommodate skewed key values.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.