Data-parallel computation processes (or jobs) typically involve multiple parallel-computation phases that are defined by user defined functions (UDFs). One factor in data-parallel computation is the creation of data-partitions with appropriate properties to facilitate independent parallel computation on separate machines or partitions in each phase. For example, often before a reducer UDF may be applied in a reduce phase, data-partitions are clustered with respect to a reduce key so that all data entries with the same reduce key are mapped to and are contiguous in the same partition.
To achieve desirable data-partition properties, data-shuffling stages are often introduced to prepare data for parallel processing in future phases. A data-shuffling stage may re-organize and re-distribute data into appropriate data-partitions. For example, before applying a reducer UDF, a data-shuffling stage might perform a local sort on each partition, re-partition the data on each source machine for re-distribution to destination machines, and do a multi-way merge on redistributed sorted data streams from source machines, all based on the reduce key. However, data-shuffling tends to incur expensive network and disk input and output operations (I/O) because it involves all of the data.