The present invention relates generally to the field of MapReduce frameworks, and more particularly to management of data spills in MapReduce frameworks.
MapReduce frameworks provide the ability to process large data sets in a distributed fashion using a cluster of multiple computing nodes. In a typical MapReduce framework implementation, a plurality of mappers are each assigned a portion of data (i.e., a split) from the data set on which to perform one or more tasks (e.g., executing a map script to count occurrences of each word in a string). The output results of each mapper are sorted (e.g., shuffling the output results such that results pertaining to the same words are grouped together) and assigned to reducers, which in turn perform one or more reduce tasks (e.g., executing a reduce script to sum all occurrence values for each word). Accordingly, the MapReduce framework not only allows large data sets to be split between many mappers and reducers, but such mappers and reducers can each perform their respective tasks simultaneously, which can greatly improve the speed and efficiency with which processing jobs can be completed.
Typically, each mapper writes its output results to a memory buffer of finite size (e.g., 100 MB). When the buffer is full, contents of the buffer are spilled to a local disk in a spill file, after which additional output results can be written to the buffer. After a mapper has written its last output result, the spill files are merged and sorted into a single output file, which can be transmitted to an assigned reducer via TCP/IP.