Map-Reduce is a framework for processing parallelizable problems across huge datasets using a large number of computing nodes, collectively referred to as a cluster or a grid. The processing can be performed on data stored either in a filesystem (unstructured) or in a database (structured). The processing usually comprises a “map” processing, a “shuffle” processing and a “reduce” processing. In the “map” processing, each of mappers processes input data and writes output data to a temporary storage on a disk. In the “shuffle” processing, output data from the mappers will be redistributed based on output keys produced by the mappers, such that all data belonging to one key are located on the same node. In the “reduce” processing, respective reducers process each group of output data, per key, in parallel.
Particularly, in the Map-Reduce frame, for each mapper, intermediate results will be stored in memory after its generation, stored into a local disk after computation and then be copied into memory for network transmission, which means many processes. Besides, it might also result in a huge amount of files. For example, if there are M mappers and R reducers, there will be M*R intermediate files stored in the local disk. Usually, each of M and R is a large number, for example 1000, which might result in one million files for a file system such as Hadoop or Yarn system. Furthermore, such a huge number of files might in turn induce issues such as huge Input/Output (I/O), and etc.