1. Technical Field
This invention pertains in general to distributed computations, and in particular to load-balancing distributed computations.
2. Description of Related Art
MapReduce is arguably the most popular modern cluster-computing paradigm. Despite its popularity, conventional MapReduce implementations suffer from a fundamental data imbalance problem. Each data item to be processed using MapReduce comprises structured data in (key, value) pairs. During a conventional MapReduce process, items are grouped by the hash value of the data key of each item. Hash functions produce very even distributions among groups when the number of items with the same data key is fairly small compared to the total number of items. However, when the number of items with the same key is fairly large compared to the total number of items, hash functions produce uneven distributions among groups. The MapReduce framework assigns items with the same key to be processed by the same processing unit. Therefore, an uneven distribution among the groups results in some processing units processing more items than other processing units. This is commonly referred to as a “load imbalance” between the processing units, and if left unaddressed, it leads to wasted resources and time.
Consider the example of grouping data records of all the computers on the internet by country in 1998, where country is used as the data key. The group corresponding to the USA would make up about 50% of all of the items. Since all of the items with the same key are processed by the same processing unit, the processing unit assigned to the group corresponding to the USA will be overloaded. As a consequence, the processing speed of the USA data is the rate limiting step in completing the MapReduce job. Even if you have 100 processing units, the MapReduce job will not be complete any faster than 2 processing units, one for the USA data key and the other for the data keys corresponding to all the other countries in the world.
The example above illustrates a load-imbalance problem in a MapReduce job that is left for computer programmers to solve when it arises. First, the programmer needs to recognize that an imbalance is occurring between processing units, and second, the programmer needs to intervene to direct some data associated with the popular keys to other processing units. However, the solutions that a programmer develops for a unique data situation is often very human-resource intensive, can be error-prone, and is not robust to changes in the data over time. If the input data changes, it may cause imbalance in another way (i.e., another key becomes more popular, adding to the workload of a different processing unit), and the programmer's previously implemented solution may in fact be exacerbating the new imbalance.