Determining a data distribution has practical application for a wide variety of fields. One such application is data partitioning, wherein a data distribution is determined to divide data into data partitions, or non-overlapping subsets, for efficient parallel processing or for other tasks. The independent ranges of each data partition minimizes the need for synchronization and concurrency control mechanisms, helping to reduce overhead during parallel processing. Additionally, if the input data can be divided into approximately equally sized or balanced data partitions, better load balancing can be achieved across various parallel processing resources, such as server nodes and processor cores. Further, if the partition sizes can be limited to meet particular hardware specifications, for example processor cache sizes or available memory, then the parallel processing can be accelerated further by avoiding cache misses or disk swapping. In some cases, a hard partition size limit may be a functional requirement, such as for in-memory processing nodes.
In the context of enterprise databases, high performance computing (HPC), and other data intensive applications, scaling is most readily achieved by interconnecting resources such as processor cores and server nodes. Database operations such as table joining, sorting, aggregation, and other tasks can utilize data partitioning to evenly distribute the processing workloads to processing threads running on the available resources. In this manner, each processing thread can process their assigned workload in a non-blocking manner to finish at approximately the same time, optimizing performance and minimizing waits for threads to finish. Accordingly, a quick and accurate determination of a data distribution has particular application in the field of databases and for other computing fields that require scaling to a large number of resources.
Challenges arise when the input data to be processed is a big data set, for example when the input data includes billions or more records. In this case, approaches that require access to the entire input data at once to determine the data distribution, such as sorting the input data, may be impractical. Sampling techniques have been proposed, which allows analysis to proceed with only a smaller sample of the input data as a whole. For example, a histogram may be generated for only a sample of the input data, with data partitions created based on the histogram. However, to produce a histogram that accurately represents the input data as a whole, the sample must be sufficiently large. Thus, even sampling techniques may be impractical for big data sets, since the minimum size for an effective sample grows in tandem with the size of the input data.
Additionally, the input data to be analyzed may potentially include any kind of data distribution. For ideal load balancing, the determination of the data distribution should be sufficiently granular to enable the creation of approximately evenly sized data partitions. If the input data is non-uniform with high skew, a histogram may only provide coarse data distribution information. Sufficient granularity may be provided by using a large number of partitions, but this approach may impose an unacceptably high processing and resource burden, especially for input data that exhibits skew over a large dynamic range.
Based on the foregoing, there is a need for a method to efficiently determine a data distribution for big data sets having potentially high skew.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.