Large-scale data processing may include extracting records from data blocks within datasets and processing them into key/value pairs. The implementation of large-scale data processing may include the distribution of data and computations among multiple disks and processors to make use of aggregate storage space and computing power. A parallel processing system may include one or more processing devices and one or more storage devices. Storage devices may store instructions that, when executed by the one or more processing devices, implement a set of map processes and a set of reduce processes.
A master-worker design pattern may be used for large-scale data processing. This pattern consists of a work job master (master) and one or more worker instances. The master takes a data processing job and divides it into smaller tasks which are managed by the worker processes.
FIG. 1 illustrates a master-worker distributed system as discussed above. In the system, a master (120) assigns application-specific data processing tasks to workers (104, 170). A given worker performs its assigned task and notifies the master when the task is complete at which point, the master may assign a new task to the worker. The system receives a data set as input (102), divides the data set into data blocks (101), performs application-specific tasks, and produces final output files (110a, 110n). The system as depicted in FIG. 1 is commonly referred to as the MapReduce model.
A parallel data processing system, such as MapReduce, receives a dataset as input and divides the dataset into data blocks called shards. The system may then decide which shard to give to a specific worker in a step referred to as shard assignment.
A goal of shard assignment is to assign a shard to a worker so that the processing of the shard incurs a minimum amount of overhead in terms of time and computational resources. In addition, inefficient assignment algorithms can result in master failure due to CPU and/or memory overload in large-scale systems. Therefore, since shard assignment is typically carried out at the master, the efficiency of the assignment algorithm should be considered in a large-scale processing job.