Computing systems and associated networks have revolutionized the way human beings work, play, and communicate. Nearly every aspect of our lives is affected in some way by computing systems. Computing systems are particularly adept at processing data. When processing data for which a schema is applied on read, rather than write, (often referred to simply as “big data”) that itself might be distributed across multiple network nodes, it is often most efficient to divide data processing amongst the various network nodes. These divisions of logical work are often referred to as “vertices” in the plural, or a “vertex” in the singular. Not only does this allow for efficiencies of parallelizing, but it also allows for the data that is being processed to be closer to the processing node that is to process that portion of the data.
One common programming model for performing such parallelization is often referred to as the map-reduce programming model. In the mapping phase, data is divided by key (e.g., along a particular dimension of the data). In the reduce phase, the overall task is then divided into smaller portions that can be performed by each network node, such that the intermediate results obtained thereby can then be combined into the final result of the overall job. Many big data analytical solutions build upon the concept of map reduce.
One problem often encountered in big data analytics is uneven distribution of data along a given dimension—referred to as “data skew”. This is a common occurrence in big data sets which can arise from natural data distributions and/or poor query formulation. The amount and characteristics of data skew may change over time as the data itself changes. Data skew can result in some network nodes taking a lot more time to perform their respective tasks, thereby effectively introducing bottlenecks in the distributed parallel processing, and thereby delaying completion of the overall tasks. When the source of the data skew is identified, the data skew can be corrected reallocating the distribution of processing amongst the data and/or correcting code that is having a more difficult time dealing with the data skew. However, identifying the source of data skew is quite difficult, even from the perspective of an experienced developer.
The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.