A data intensive scalable computing (DISC) system is a computing system distributed over a cluster or grid of computers that are designed to process large amounts of data that may be generated in a variety of applications and environments. Examples of applications and environments that generate such large amounts of data include, but are not limited to, science (e.g., imagery data), commerce (e.g., online transaction records), and society in general (e.g., medical or other personal records, web pages).
A variety of software frameworks have been introduced that support processing of the large scale data sets in a DISC system. One such software framework is known as MapReduce™ which was developed by Google™ (Mountain View, Calif.) and is described, for example, in U.S. Pat. No. 7,650,331, the disclosure of which is incorporated by reference herein in its entirety. MapReduce™ is a software framework that distributes computations involving large scale data sets over the computers (nodes) of the DISC computer system. In general, MapReduce™ uses “mapper worker” nodes and “reducer worker” nodes to take a given task and break it into sub-tasks which are distributed to one or more nodes of the DISC system for processing. The sub-tasks are processed and results are combined into a composite result for the given task. The “map” stage is generally where the given task is broken into sub-tasks, and the “reduce” stage is generally where the composite result is generated.
Furthermore, access to the large scale data sets in a DISC system is typically managed by a storage file system. In the case of the MapReduce™ environment, a file system such as the Google File System (GFS) may be utilized, see, e.g., S. Ghemawat et al., “The Google File System,” 19th ACM Symposium on Operating Systems Principles, Lake George, N.Y., October 2003, the disclosure of which is incorporated by reference herein in its entirety. In GFS as applied to a DISC system, servers store “data chunks” as files in the local file system. As such, in a DISC system that employs GFS, the computation and data are tightly coupled. For example, with GFS, the intermediate result of a mapper worker node is written to a local disk, and the intermediate result will then be shuffled to many other reducer worker nodes. Unfortunately, if a mapper worker node fails, the task performed on it has to be redone.