Large-scale data processing involves extracting data of interest from raw data in one or more datasets and processing the extracted data into useful product. The implementation of large-scale data processing in a parallel and distributed environment may include disseminating data and computations among multiple disks and processors to make use of aggregate storage space and computing power.
A master-worker design pattern may be used for large-scale data processing. This pattern consists of a work task master (master) and one or more worker instances. In the master-worker processing system, the master takes a data processing problem and divides it into smaller tasks which are executed by the worker processes.
FIG. 1 illustrates an example of a conventional master-worker distributed system as discussed above. In the conventional system, a master (120) assigns data processing tasks to workers (104, 108). The system receives a data set as input (102), divides the data set into data blocks (101), performs application-specific tasks, and produces final output files (110a, 110n). A given worker performs its assigned task and notifies the master when the task is complete. Although FIG. 1 shows a certain two-stage worker system with two sets of workers (104 . . . 104n), (108a . . . 108n), a distributed data processing system may only include one stage of workers.
An exemplary conventional system is commonly referred to as the MapReduce model and is described in detail in MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: sixth Symposium on Operating System Design and Implementation, San Francisco, Calif., December, 2004 and U.S. Pat. No. 7,650,331. However, the present patent is not limited to the MapReduce context, but rather to the broad class of distributed parallel processing applications.