The term “Big Data” typically refers to datasets too large for processing in a traditional manner on a single computer. Instead, the analysis of such datasets may require distributing tasks across multiple computing devices, or nodes. These nodes may then efficiently execute the distributed tasks in parallel. But a non-uniform distribution of tasks across nodes may dramatically reduce the efficiency of this approach. Such a non-uniform distribution of tasks may arise when processing data with a non-uniform (or skewed) distribution of values.
For example, a dataset of customer accounts may include a skewed distribution of customer ages, due to the difference in population between different generations. This skewed distribution may inhibit efficient analysis of this dataset. For example, when the customer accounts are grouped by customer age into decades, the resulting decade groups will contain unequal numbers of accounts. The time efficiency with which the analysis of the dataset according to this decade grouping could be accomplished would be adversely affected by the time required to analyze the largest group.
Existing methods may attempt to improve efficiency by monitoring nodes during execution of an analysis. These methods may rely on detecting and stopping long-running tasks. The stopped tasks may be divided into smaller tasks and redistributed among the nodes. While such monitoring may be performed automatically, such automatic methods are experimental, unstable, and difficult to apply to some calculations. Manual methods necessitate manually updating software instructions, and are therefore tedious, complicated, and error-prone. Other existing methods simply iterate trial analyses, updating task allocations until an efficient allocation is discovered. But the duration of each trial analysis may vary from minutes to hours, rending this approach unpredictable and inefficient. A need therefore exists for improved dynamic skew compensation for parallel processing of large datasets.