Large-scale data processing involves extracting data of interest from raw data in one or more datasets and processing it into a useful data product. The implementation of large-scale data processing in a parallel and distributed processing environment typically includes the distribution of data and computations among data storage devices (e.g., low speed memory and high speed memory, where the data seeking time on high speed memory is much faster than the data seeking time on low speed memory) and processors to make efficient use of aggregate data storage space and computing power.
Large-scale data processing techniques such as a map-reduce operation (sometimes called a large-scale data processing operation) have proven to be a remarkably flexible system for parallelizing computation on clusters. A system and method for efficiently performing such computations are becoming increasingly important as the size of the data sets and the size of the computer clusters used to perform the computations grow. One of the hardest performance challenges is to limiting the impact of (e.g., minimize the delay caused by) stragglers in parallel computation. In one embodiment, reduce stragglers are reduce processes that are running after a substantial portion of the total number of reduce processes have finished running (e.g., the last 10% of reduce processes that are running).
Various functional languages (e.g., LISP™) and systems provide application programmers with tools for querying and manipulating large datasets. These conventional languages and systems, however, fail to provide support for automatically parallelizing these operations across multiple processors in a distributed and parallel processing environment. Nor do these languages and systems automatically handle system faults (e.g., processor failures) and I/O scheduling. In addition these conventional large-scale data processing techniques are often adversely affected by stragglers. The disclosed system and method eliminates or reduces the impact of such stragglers on large scale data processing computations.