Key-based processes are often used in data centers and other clusters or groups of computing entities where distributed processes are carried out.
Data centers and other clusters of computing entities are increasingly available and used to carry out computationally intensive processes, typically by distributing those processes over many computing entities in order to share the huge workload. For example, large input data sets may be processed at data centers by dividing the input data set between hundreds or thousands of servers at the data center so that each server may contribute to the task of processing the whole data set. In order to manage this division of labor effectively the huge data set is to be divided in an appropriate manner and the results of the processes at the individual servers need to be combined appropriately to give accurate results. One approach has been to use key-based processes which are processes for data-parallel computation which use key-value pairs. By using key-value pairs a framework for taking a task, breaking the task up into smaller tasks, distributing those to many computing entities for processing; and then combining the results to obtain the output is achieved. For example, in a process to count the frequency of each different word in a corpus of documents a key may be a word and a value may be an integer representing the frequency of that word in the corpus of documents. The keys may be used to enable intermediate results from the smaller tasks to be aggregated appropriately in order to obtain a final output.
Key-based processes for use with huge data sets distributed over hundreds or thousands of servers are increasingly being used as a data processing platform. These types of key-based processes typically comprise a map phase and a reduce phase. During the map phase each server applies a map function to local chunks of an input data set in parallel. A plurality of reducers work in parallel to combine results of the map phase to produce an output. During a reduce phase all outputs of the map phase that share the same key are presented to the same reducer.
There is an ongoing need to improve the speed, efficiency and accuracy of operation of these types of key-based processes on data centers or other clusters of computing entities.
The embodiments described below are not limited to implementations which solve any or all of the disadvantages of known systems and methods for supporting key-based processes.