The disclosure generally relates to the field of data processing, and more particularly to virtual machine task or process management or task management/control.
Massively scalable, distributed file systems and the MapReduce programming paradigm have been developed to store, organize, and analyze the massive volumes of data (terabytes to petabytes) being generated. A massively scalable, distributed file system (e.g., the Hadoop® distributed file system) provides fault tolerance despite the data being stored on thousands of machines made of inexpensive, commodity hardware that is likely unreliable. Analysis of this data may be for search indexing, bioinformatics, genomics, data mining, machine learning, etc. Analysis for any one of these purposes can involve computationally complex processing of large data sets (e.g., multi-gigabyte file sizes). The MapReduce programming paradigm was developed for processing very large data sets distributed across a cluster of machines that can number in the thousands. This programming paradigm conceals the complexity of distributed systems and parallelization while allowing use of the resources of a distributed system.
To implement a MapReduce paradigm, a map function and a reduce function are written for an application for a MapReduce framework. A MapReduce framework provides a library of the MapReduce functionality for partitioning input data and parallelizing tasks on partitioned data that is not specific to an application. An application submits a job for scheduling on a cluster of machines on which a MapReduce framework is deployed. Using the MapReduce framework, the job is decomposed into a set of map tasks and reduce tasks that correspond to a user-defined map function and user-defined reduce function, respectively. These user-defined functions are defined based on the MapReduce framework. The MapReduce framework partitions the input data into smaller chunks of the input data. Multiple instances of the MapReduce framework are instantiated in the cluster. A master instance assigns map tasks to idle worker instances. A “mapper” (worker instance assigned a map task) reads an assigned chunk and parses key/value pairs out of the chunk. The mapper then passes each key/value pair to the user-defined map function, which filters and aggregates the key/value pairs to produce intermediate key/value pairs. The master instance assigns partitions of the intermediate key/value pairs to reducers (idle worker instances assigned reduce tasks). A reducer reads a partitioned set of intermediate key/value pairs (“region”) and sorts the pairs by the intermediate keys to group the data. The reducer then passes each unique intermediate key and corresponding grouped values to the user-defined reduce function, which carries out the processing for the job.