Queries commonly involve operations, such as sort and aggregation, which, if computed in parallel, require data to be redistributed from one set of slave processes to another set of slave processes. For example, a parallel sort operation may involve a set of table scan slaves and a set of sort slaves. The table scan slaves read data items from a table. Each of the sort slaves is assigned a value range. The table scan slaves redistribute the data items to the sort slaves based on (a) the sort key values of the items and (b) the ranges assigned to the sort slaves. Each sort slave then sorts the data items that the sort slave receives from the table scan slaves.
The bundle of operations that is assigned to a particular slave set is referred to herein as a DFO (Data Flow Operation). Thus, boundaries between DFOs represent points at which data items may be redistributed from one slave set and to another slave set.
The redistribution of data between DFOs can be very costly, since the DFOs might be executed by slave sets on different nodes. When data items are redistributed between nodes, network I/O contributes significantly to the completion time of the query.
For example, this query (Q1): select avg(sal) from emp group by deptno, mgr over table emp partitioned on deptno column,
has the following execution plan, with Query Coordinator represented by QC and Parallel Execution represented by PX:
OperationNameTQIN-OUTPQDistribSELECT STATEMENT PX COORDINATOR  PX SEND QC (RANDOM):TQ1000101P->SQC(RANDOM)   HASH GROUP BY01PCWP    PX RECEIVE01PCWP     PX SEND HASH:TQ1000000P->PHASH      PX BLOCK ITERATOR00PCWC       TABLE ACCESS FULLEMP00PCWP
Like in the classical rowsource or iterator execution model, the execution plan is started top-down, each parent opening and fetching from its child iterators and the rows flow from the bottom-up the iterator tree in a demand-driven manner. Thus, at the bottom of the iterator tree a set of table scan slaves scan table EMP (TABLE ACCESS FULL). The table scan slaves iterate until all rows of EMP have been scanned (PX BLOCK ITERATOR), and then redistribute the rows using a hash distribution (PX SEND HASH).
The rows are received by a set of group by slaves (PX RECEIVE). The group by slaves group the values based on the grouping key (HASH GROUP BY). The group by slaves then send their results to a query coordinator (PX SEND QC (RANDOM)).
In the preceding example, between the table scan slaves and the group-by slaves, all the rows from EMP are redistributed based on the grouping key. This redistribution may be costly, particularly if some or all of the table scan slaves reside on different nodes than the group-by slaves.
Another performance concern for the above plan is the amount of memory used for the aggregation. If sort aggregation is used, the number of rows sorted by each aggregation slave will be approximately the number of rows in the table (num_rows_table) divided by the degree of parallelism (DOP). This number can be very large, and might not fit into the volatile memory available to the slave for performing the operation. In this context, the degree of parallelism is the number of aggregation slaves. If persistent storage is used to perform the sort aggregation, multiple passes through the disk will be required, and that affects the performance of the sort.
One way to reduce the amount of data items that need to be redistributed between slave groups is referred to herein as a push-down aggregation technique. Using the push-down aggregation technique, the table scan slaves perform partial aggregation prior to redistributing the data items to the aggregation slaves. Using the push-down aggregation approach, the parallel plan for query Q1 may look like:
OperationNameTQIN-OUTPQ DistribSELECT STATEMENT PX COORDINATOR  PX SEND QC (RANDOM):TQ1000101P->SQC (RAND)   HASH GROUP BY01PCWP    PX RECEIVE01PCWP     PX SEND HASH:TQ1000000P->PHASH      HASH GROUP BY00PCWP       PX BLOCK ITERATOR00PCWC        TABLE ACCESS FULLEMP00PCWP
Because partial aggregation is performed by the table scan slaves, only the partially aggregated data needs to be redistributed. Under some conditions, the partially aggregated data may be significantly less than the original input data. Doing a preliminary aggregation before the redistribution can significantly reduce the TQ traffic if the number of rows with identical grouping key values is large enough, since each slave in the scan DFO is now outputting only one row per group.
Unfortunately, the push-down aggregation technique is less helpful when each table scan slave encounters few identical group-by values. Even though many rows may have the same group-by value, if those rows are scanned by different table scan slaves, the partial aggregation performed by the table scan slaves will not reduce the number of rows, from that group, that will have to be redistributed to the aggregation slaves.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.