This specification relates to shuffling operations in a distributed data processing system.
A shuffle operation is an intermediate step in a distributed data processing system in which data produced by writers is grouped by key data for consumption by readers. One example of a distributed data processing algorithm that utilizes a shuffle operation is a map reduce algorithm. The writers are implemented in the map phase, during which parallel tasks are created to operate on data to generate intermediate results. In the shuffle phase, the partial computation results of the map phase are arranged for access by readers that implement the reduce operation. During the reduce phase, each reader executes a reduce task that aggregates the data generated by the map phase. Other distributed data processing algorithms also shuffle data by a shuffle operation.
The shuffle operation involves grouping a stream of records according to keys included in the records. The keys may be alphanumeric strings or numerical identifiers. The records may be presented to the shuffle operation by a set of shuffle writers in a random order. A set of shuffler components may receive the records and group them according to their included keys. The shuffler components may then provide the records, now grouped by key, to a set of shuffle readers.