An increasing number of computing applications, particularly within the enterprise, entail analyzing distributed data. There various ways in which data may be partitioned for distributed analysis. Algorithms may explicitly partition data into multiple chunks, as is common in “divide and conquer” algorithms. Alternatively, data may originate in a distributed manner. For example, data may originate from user uploads to an array of computing nodes, such as pictures posted to a social networking site running in a server farm. In another example, data may be event data obtained by monitoring and recording events that occur on each of a plurality of computing nodes. Such events may include, for example disk accesses, network traffic, application events, etc.
Distributed data may be analyzed to identify trends, generate reports, search for specific records, etc. Often, such data analysis includes the calculation of order statistics on a collection of real numbers. One type of order statistic is the quantile of a number, such as the median or the nth percentile. Quantiles may be used to answer the question “what test score is greater than 90% of all other test scores”. Quantiles may also be used to answer the question “what is the percentile rank of this given test score?” Other examples of order statistics include a most frequent data value such as a consensus value, a histogram of data distribution, and range queries.
One way to calculate exact order statistics is to sort the collection of numbers. Then, for example, the median may be found by iterating halfway through the sorted list. However, this method becomes prohibitively expensive in terms of memory usage and computation time for very large data sets. A better method would be to store only unique numbers and the count for each unique number. However, even this improved method becomes impractical when the cardinality of the data set is high; that is, when there are many distinct values, which is very common for numerical data. Moreover, distributing these calculations does not alleviate the problem of the prohibitively large data set, because the results of each distributed calculation must still be combined on a single computing node before order statistics can be inferred.
One method of calculating approximate order statistics is to divide a range of numbers into sub-ranges, count how many numbers fall within each sub-range, and derive order statistics from the counts. However, such techniques provide no bound on the amount of error in the approximation. Accordingly, calculating approximate order statistics in a time and resource efficient manner while minimizing error is an ongoing challenge.