Distributed computing is a form of computing—generally by operation of application programs or “code”—in which many calculations are carried out simultaneously, often on a large body of “data,” on the premise that large problems can often be divided into smaller problems which can be solved concurrently (“in parallel”) for efficiency. To accomplish this parallelism, distributed computing makes use of multiple autonomous computers (or processors) to solve computational problems by dividing the problem into many sub-problems that are then solved by one or more of the autonomous computers (or nodes) in a cluster of computers. To perform computations on very large problems or datasets, distributed computing clusters (comprising tens, hundreds, or even thousands of autonomous computers) may be utilized.
Modern distributed execution engines (such as MapReduce, Hadoop, and Dryad) and their corresponding high-level programming languages (Pig, HIVE, and DryadLINQ) have done much to simplify the development of large-scale, distributed data-intensive applications. In all of these systems, execution parallelism is controlled through data partitioning which in turn provides the means necessary to achieve high-level scalability of distributed execution across large computer clusters. Thus, efficient performance of data-parallel computing heavily depends on the effectiveness of data partitioning.
However, current data partitioning techniques are often simplistic and can lead to unintended performance degradations or job failures. Many of the known techniques—originally developed for database applications—are ill-suited for complex user-defined functions and data models common to data-parallel computing. When partitioning data to enable parallel computations on multiple computers, the initial challenge is determining which partition function to use and how many data partitions to generate, and the wrong choices—or even the best choices from among limited options—can result in highly skewed workloads leading to poor performance with some machines completing in minutes while others running for hours. Consequently, the efficiency of the entire distributed processing system is constrained by the least efficient partition from among all of the partitions created. As distributed execution systems become increasingly used for more complex applications—such as large-scale graphing applications to detect botnets or analyze large-scale scientific data—the lack of effective and efficient partitioning schemes for distributed execution engines have become a major performance liability.