Statistical measures play an important role for the analysis of data sets. One general class of such statistical measures consists of the quantiles of a set of data. Quantiles of different ranks can together summarize what data is stored and how it is distributed.
Computers permit rapid evaluation of quantiles of large data sets. While the availability of affordable computer memory (volatile and permanent) is steadily increasing, there continue to be limitations associated with such memory. Typical algorithms will re-order the elements of the data set in place or they will need additional memory that is at least half of the size of the original data set. Several conventional techniques, such as those discussed below, provide various quantile determination algorithms.
Simple and Precise Algorithms.
A typical simple determination algorithm requires sorting the values and then picking the element in the needed position in the array. Such an algorithm needs O(N) space, where N is the number of rows. Assuming, for example, that one datapoint consumes 8 bytes (=64 bits), determining a quantile over N=100 million rows needs 800 MB of temporary memory. Traditional commodity computer hardware provides the capability for using this type of algorithm with only small inputs or may require the user to swap out to a disk. The sorting requires O(N log N) runtime. Such an approach can be used to determine several quantiles on the data without extra memory or runtime cost.
Selection Algorithms.
Better runtime performance could be achieved by using a “Selection algorithm”, but just like sorting, it will need space proportional to the number of input elements (https://en.wikiedia.org/w/index.php?title=Selection_alorithm&oldid=622007068). Optimizations regarding the needed memory are possible if only a single quantile is requested and that quantile has a very low or very high quantile rank (for example, 0.1 or 0.9).
Lower Bound for Precise Algorithms.
Pohl (I. Pohl, “A Minimum Storage Algorithm for Computing the Median”, Technical Report IBM Research Report RC 2701 (#12713), IBM T J Watson Center, November 1969) proved in 1969 that any deterministic algorithm that computes the exact median in one pass needs temporary storage of at least N/2 elements. Munro and Paterson (J. I. Munro and M. S. Paterson, “Selection and sorting with limited storage”, in Theoretical computer science vol. 12, 1980) proved in 1980 that the minimum space required for any precise algorithm is Θ(N**1/p), with p being the number of passes over the data. Accordingly, a more precise result with less memory than O(N) may be achieved by implementing more passes over the data. In their proof, Munro and Paterson sketch an algorithm for determining the quantiles in several passes with almost no extra memory.
Disk-Based Sorting.
Another conventional alternative is to write the values to disk and then sort them. However, disk-based sorting is orders of magnitude slower than in-memory operation. Therefore, this is not a viable option for interactive applications where response times matter.
Approximation Algorithms.
In more recent times there have been a number of publications that describe low memory quantile calculations that give up some of the precision requirements in favor of lower memory consumption. Three of these known techniques are now discussed.    1. Manku, Rajagopalan, and Lindsay (1998)
In 1998 Manku, Rajagopalan, and Lindsay (G. Manku, S. Rajagopalan, B. Lindsay, “Approximate medians and other quantiles in one pass and with limited memory”, in Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data) presented an approximation algorithm as an improvement on Munro and Paterson's 1980 work:                Space: O(1/ε log2(εN))        Runtime: not stated        
The error ε is the factor by which a quantile reported by the algorithm may differ from the real quantile. A quantile is said to be “ε-approximate” if its real rank is guaranteed to be within [r−εN; f+εN] of the rank r of the reported quantile value. This is not to be confused with the number of precise digits of the reported value. Results are proven to be ε-approximate. As seen above, the memory requirement depends on the desired maximum error.
Manku, et al. built upon the algorithm described by Munro and Paterson in 1980. They change one pass of the original algorithm so that this one pass yields a quantile that is correct within the error bounds. After just a single pass they have the approximate quantile. Manku et al. assert that their algorithm needs less space than that of Munro and Paterson. Related patents are U.S. Pat. No. 6,108,658A and U.S. Pat. No. 6,343,288B1.    2. Greenwald, Khanna (2001)
In 2001, Greenwald and Khanna (M. Greenwald, S. Khanna, “Space-efficient Online Computation of Quantile Summaries”, in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data) presented an algorithm for the space-efficient computation of quantile summaries.                Space: O(1/ε log(εN))        Runtime: high cost (not reported)        
Results were proven to be ε-approximate. As seen above, the memory requirement depends on the desired maximum error.
Real-world results have been obtained through a modified version of the algorithm, rather than the one outlined in the Greenwald and Khanna article. With the modified variant, the memory requirements in terms of stored items were about half as big as for the Manku et al. method, but the needed data structures are more complex.    3. Zhang, Wang (2007)
In 2007, Zhang and Wang (Qi Zhang, Wei Wang, “A Fast Algorithm for Approximate Quantiles in High Speed Data Streams”, in 19th International Conference on Scientific and Statistical Database Management, 2007) presented an algorithm for the computation of approximate quantiles with the following space and time complexities:                Space: O(1/ε log2(εN))        Runtime: O(N log(1/ε log(εN)))        
Zhang and Wang demonstrated through several experiments that their algorithm is about 200 times faster than Greenwald and Khanna algorithm. The Zhang, Wang algorithm has deterministic bounds on the maximum error. The summary data structure from which the approximate quantile is read as the last step in the execution of the algorithm also contains guaranteed minimum and maximum ranks for all values stored in the summary.
Precise Results Using an Approximation Algorithm.
In 2001, Fu and Rajasekaran (L. Fu, S. Rajasekaran, “Novel Algorithms for Computing Medians and Other Quantiles of Disk-Resident Data”, in Proceedings of the 2001 International Database Engineering and Applications Symposium) designed and compared different algorithms for computing quantiles on disk-resident data. Their use case is the computation of quantiles from data residing on a disk with the data being bigger than available main memory. Fu and Rajeskeran assert that in the case of an external algorithm, the key issue is to minimize the number of passes needed to solve the problem. They make use of the Manku et al. algorithm and adapt it to deliver precise results. Fu and Rajasekaran state that “It should be noted here that the original algorithm of Manku et al. was proposed for computing approximate quantiles. We adapt this algorithm for exact selection by using the given rank error guarantee . . . . The paper of Manku et al . . . gives the upper bound of the rank difference between real target and output result. From the error guarantee, we can compute the bounds that bracket the target, thus adapting the approximate quantiling algorithm to the selection problem.”
In many application areas the calculation of exact results is essential. In empirical sciences statistical evaluations are at the base of many findings and theories. As data collection in these sciences is often associated with a high cost, the empirical significance of the findings is often at stake. At least the calculations on the data that is obtained have to be right and must not add another source of error. In a business domain many companies base important business decisions on statistical evaluations. It is imperative that costly decisions are based on the correct data.
At the same time, with ever growing volumes of data and data analysis becoming increasingly interactive, it is more important than ever that algorithms operate quickly (by working in memory only and using a fast algorithm with a minimum number of passes) and utilize memory efficiently.
Existing algorithms aim to either: (1) deliver a precise result using a fixed amount of memory by trading in runtime performance (for example, multiple passes; Munro and Paterson); or (2) use less memory, but only deliver approximate results (for example, Manku et al.)
The concept of Fu and Rajasekaran of using the approximation algorithm of Manku et al. as an initial step for determining precise quantiles constitutes a mix of both of the points above. It employs an approximation algorithm, but fixes the available memory to 200 KB. Thus, although the authors claim that minimizing the number of passes is essential, the algorithm they use does not provide means for guaranteeing that the number of passes is indeed minimal.
One or more embodiments discussed herein can address the aforementioned problems, with traditional systems, by fixing the number of passes to a certain number, such as two, and then optimizing the amount of required memory. More specifically, this can be achieved by exploiting properties of an approximation algorithm for preprocessing the data in the first pass in such a way that the second pass is guaranteed to find an exact, precise result with acceptable memory consumption.