MapReduce is software framework proposed by Google for parallel computing on large data sets (greater than 1 TB). The concepts of “Map” and “Reduce” as well as the main ideas thereof are borrowed from functional programming languages. Current MapReduce middleware implementations require application developers to specify a Map function to map a set of key-value pairs to some new key-value pairs referred to as intermediate key-value pairs, and also require application developers to specify a Reduce function to further process the intermediate key-value pairs outputted from the Map function. Map invoking partitions the input data into M input data splits automatically, which input data splits can be distributed to multiple machines for parallel processing. Reduce invoking partitions the intermediate keys into R splits (e.g., hash (key) mod R) through a partition function, which splits are also distributed to multiple machines. The number of the partitions R and the partition function may be specified by the user. MapReduce achieves scalability by distributing large-scale operations on data sets to multiple nodes on the network.
Currently, MapReduce is considered an important program design specification for building a data center and already has a very wide range of applications, typically including: distributed grep, distributed sorting, web access log analysis, reverse index construction, document clustering, machine learning, statistics-based machine translation, and so on. In order to fulfil demands on MapReduce processing/generation of large amounts of data, there is a need to build an underlying network architecture applying MapReduce, for example, a new type of data center employing the converged network architecture. However, for a traditional data center, rebuilding an underlying network architecture applying MapReduce will cost a lot of money. FIG. 1 shows a schematic diagram of a network architecture of a traditional data center having a storage network, wherein the network topology of the traditional data center is usually composed of two networks: a local area network (LAN) and a storage network (SAN). SAN is a dedicated high-performance network for transmitting data between various servers and storage resources, avoiding traffic collision problems which usually occur between clients and servers in a traditional message network (such as a TCP/IP network commonly used in LAN). FIG. 2 shows a schematic diagram of MapReduce data transmission in a traditional data center having a storage network according to the prior art. Since the current MapReduce middleware transmits the outputs of a Map task to a Reduce task through the HTTP application layer protocol, there is a need to use the TCP/IP protocol stack, in which case, the outputs of the Map task can only be transmitted to the Reduce task through the low-performance network LAN, resulting in inefficient processing of MapReduce data jobs.
Therefore, there is a need for a method of improving MapReduce data transmission efficiency without changing the hardware architecture of the traditional data center.