The present invention relates to a distributed file system, and more specifically, to cache management for a MapReduce application based on a distributed file system.
A distributed file system means physical storage resources managed by a file system are not necessarily directly connected to a local node, but rather, are connected to the node through a computer network. The distributed file system is designed based on a client/server model. A typical network might include a plurality of servers available for multi-user access.
MapReduce is a software architecture, proposed by Google, for large-scale parallel programming. Because the MapReduce architecture realizes parallel operation of a large-scale dataset (greater than 1 TB), and because scalability is realized through distribution of the operations on the large-scale dataset over a plurality of nodes on the network for parallel operation, a distributed file system is widely used. The concepts “Map” and “Reduce,” are functions borrowed from functional programming languages. Implementation of the current MapReduce middleware requires an application developer to assign a Map function, for mapping a group of key values into new key-value pairs called “intermediate key-value pairs”, and to designate a Reduce function to process the intermediate key value pairs that result from the Map function.
A typical distributed file system stores partitioned file blocks on a plurality of computing nodes, and duplicates each file block into a plurality of duplicate copies saved over different computing nodes. For a computation that requires repetitive iteration, the computing results of each iteration performed by MapReduce is written into the storage medium of the distributed file system, and then read out from the storage medium as the input data for the next iteration. As a result, the read/write operations for file blocks on multiple computing nodes will inevitably generate network overhead for file transfer, and result in computational delay.
Existing MapReduce architecture-based distributed file systems, e.g., Main Memory Map Reduce (M3R) and Apache™ Spark™ modify the existing MapReduce mechanism on an Apache™ Hadoop® basis, such that all Map task threads and Reduce task threads of a processing job share the memory space of one process, with the data being read into memory at one time. This enables subsequent processing to directly operate in memory, avoiding frequent accesses to the storage medium of the distributed file system, and replacing the storage medium accesses with memory operations. However, once a Map task or Reduce task of the job fails and it needs to be re-executed, all remaining Map tasks and Reduce tasks for that job will need to be re-executed as well, consuming considerable computing resources.
Other MapReduce architecture-based distributed file systems, for example, Tachyon and Redis systems, provide cache memory management. A MapReduce job's Mapper processing results can, in these systems, be cached in the cache memory managed by Tachyon and Redis, and subsequent iteration computations can directly read the data needed for computation from the cache memory. However, in the Tachyon system and Redis system, the data in the storage medium of the distributed file system is read into the cache memory according to a preset cache slice size, and an intermediate computation result of each reiteration is written into the cache memory according to the preset cache slice size. Different preset cache slice sizes can cause discrepancy in the read performance. In the case that the set cache slice size is relatively large, the data read speed is likely to be slower than reading from the storage medium, and the cache memory allocated for each Mapper will become greater, thereby restricting the number of Mappers that can run simultaneously, which further affects the performance. In the case that the set cache slice size is relatively small, data needs to be read from the storage medium more often. Because an open/close of the files in the distributed file system is required multiple times, a greater processing delay is generated. Moreover, if an insufficient number of Mappers execute simultaneously, part of cache memory can be in an idle state, which causes a waste.
Therefore, it is advantageous to configure a cache size for a MapReduce application based on a distributed file system that can efficiently cache the data of a MapReduce job needing iterative computations, to enhance the utilization of cache memory and shorten processing delays.