Some embodiments of the present disclosure are directed to an improved approach for implementing load balancing using progressive sampling to achieve load balancing quality targets.
Good parallel program scalability derives from being able to balance workloads evenly across all available computational resources. In the context of massive data stores and retrieval systems (e.g., Hadoop) a MapReduce job can be managed under a parallel programming paradigm where programmers specify:                A map task (e.g., Hadoop map function) running on a computer, which map task processes input key/value pairs (and generates an intermediate set of key/value pairs), and        A reduce task (e.g., Hadoop reduce function) running on a computer, which reduce task processes the intermediate key/value pairs and generates an output set of key/value pairs.        
In this paradigm, the MapReduce job can parallelize flows such that a flow applies a map function to every input record and then runs the reduce function once for each distinct intermediate key. Generally, the overall execution time of a MapReduce job is determined by slowest single flow. An ideal parallelized assignment is one where all flows complete their workload at the same time; that is, there is zero or near zero skew between the completion times of any of the flows.
One approach is to statically-assign workload to the reducers. However, reducer data skew can occur when too many reduce-keys are assigned to the same reduce task. Moreover, the nature of the workload may not be known a priori, and any attempt to statically assign a reduce-key is merely a guess, and the skew might turn out to be quite significant (e.g., in the case of a bad guess of a static assignment). Other legacy approaches, such as performing hashing to assign a workload to the reducers, assumes a near constant workload per work item, which assumption might turn out to be an egregiously bad assumption. Moreover, hashing as a partitioning strategy does not work in applications such as sorting, which require alternate techniques such as range partitioning.
The aforementioned approaches do not support load balancing using progressive sampling to achieve load balancing quality targets. Therefore, there is a need for an improved approach.