Present invention embodiments relate to multistage aggregation, and more specifically, to using digest order to reduce downstream computations after a first stage of aggregation.
The largest consumers of a central processing unit (CPU) in a data warehouse are aggregation operations such as “GROUP BY” and “JOIN” operations. In massively parallel processing systems, GROUP BY operations are often performed in two phases: a first phase involving local aggregation of data, e.g., data located on a single node, followed by a repartitioning of the data, and a second phase involving global aggregation of data, e.g., combining first phase data from a plurality of nodes. If there are a large number of distinct aggregation groups, the second phase can do almost as much computational work as the first phase and a significant performance cost is incurred.