Computer systems may rely on parallel computations for increased performance. For storage of large datasets, parallel computers may distribute or partition data records to multiple locations, e.g., local storage attached to nodes in a parallel computer. One method of improving performance and scalability may be to employ efficient local processes as well as minimal re-partitioning of intermediate results. For example, one process may include data reduction prior to re-partitioning, e.g., by grouping or compression. If grouping or compression is not effective, e.g., because input data is already compressed, the attempt to compress (group, etc.) the data prior to re-partitioning may not be beneficial. Further, it may not be known a priori whether or not compression (grouping, etc.) will be effective. Thus, any a priori decision may turn out to be sub-optimal during execution.
An example of a parallel system may include a parallel database management system, where each table or index (dataset) may be partitioned across multiple nodes, each with its own storage or assigned storage partition. Rows in a table (or records in an index) may be partitioned using a range-based or hash-based partitioning function such that equal values are assigned to the same partition. When processing a SQL “group by” query, partitioning at each node may be sufficient to compute the global result if the grouping columns in the query are equal to (or, more generally, functionally dependent on) the partitioning columns of the stored dataset. If this is not the case, evaluation of the query may need re-partitioning of data.
An example of an evaluation plan may include re-partitioning of the data according to the grouping attributes followed by a single step for grouping and aggregation. If, however, grouping reduces the data volume by a factor larger than the number of nodes involved, then the data volume that is to be re-partitioned can be reduced by grouping and aggregation prior to re-partitioning. In other words, some grouping, aggregation, and data reduction precedes re-partitioning and the final grouping and aggregation follows. Data distributions are however generally not uniform in practice. If data partitions differ in characteristics, the most optimal computations reflecting those characteristics may also differ.