The integration of data from a plurality of data sources may produce large data sets that need to be managed efficiently and effectively. However, conventional methods of integrating large data sets have performance barriers because of the size of the data sets, which leads to relatively long processing times and relatively large computer resource use.
Several newer techniques of integrating data sets have been proposed to parallelize the integration process and reduce long processing times based on the MapReduce framework. In the MapReduce framework, data sets are partitioned into several blocks of data using keys assigned by map task operations and allocated in parallel to reduce task operations.
A common problem with the MapReduce framework is data skew, which occurs when the workload is non-uniformly distributed. When typical data skew occurs, computer resources that process a reduce task receive a relatively large amount of workload and require a relatively longer amount of processing time to complete the tasks compared to other computer resources that process other reduce tasks, which diminishes the benefits of parallelization.
Thus, embodiments of the present disclosure relate to dynamic partitioning of tasks in a distributed computing environment to improve data processing speed.