Statistical measures play an important role for the analysis of data sets. One general class of such statistical measures consists of the quantiles of a set of data. Quantiles of different ranks can together summarize what data is stored and how it is distributed.
Computers permit rapid evaluation of quantiles of large data sets. While the availability of affordable computer memory (volatile and permanent) is steadily increasing, there continue to be limitations associated with such memory. Typical algorithms will re-order the elements of the data set in place or they will need additional memory that is at least half of the size of the original data set. Several conventional techniques, such as those discussed below, provide various quantile determination algorithms.
Simple and Precise Algorithms.
A typical simple determination algorithm requires sorting the values and then picking the element in the needed position in the array. Such an algorithm needs O(N) space, where N is the number of rows. Assuming, for example, that one datapoint consumes 8 bytes (=64 bits), determining a quantile over N=100 million rows needs 800 MB of temporary memory. Traditional commodity computer hardware provides the capability for using this type of algorithm with only small inputs or may require the user to swap out to a disk. The sorting requires O(N log N) runtime. Such an approach can be used to determine several quantiles on the data without extra memory or runtime cost.
Selection Algorithms.
Better runtime performance could be achieved by using a “Selection algorithm”, but just like sorting, it will need space proportional to the number of input elements (https://en.wikipedia.org/w/index.php?title=Selection_algorithm&oldid=622007068). Optimizations regarding the needed memory are possible if only a single quantile is requested and that quantile has a very low or very high quantile rank (for example, 0.1 or 0.9).
Lower Bound for Precise Algorithms.
Pohl (I. Pohl, “A Minimum Storage Algorithm for Computing the Median”, Technical Report IBM Research Report RC 2701 (#12713), IBM T J Watson Center, November 1969) proved in 1969 that any deterministic algorithm that computes the exact median in one pass needs temporary storage of at least N/2 elements. Munro and Paterson (J. I. Munro and M. S. Paterson, “Selection and sorting with limited storage”, in Theoretical computer science vol. 12, 1980) proved in 1980 that the minimum space required for any precise algorithm is Θ(N**1/p), with p being the number of passes over the data. Accordingly, a more precise result with less memory than O(N) may be achieved by implementing more passes over the data. In their proof, Munro and Paterson sketch an algorithm for determining the quantiles in several passes with almost no extra memory.
Disk-Based Sorting.
Another conventional alternative is to write the values to disk and then sort them. However, disk-based sorting is orders of magnitude slower than in-memory operation. Therefore, this is not a viable option for interactive applications where response times matter.
Approximation Algorithms.
In more recent times there have been a number of publications that describe low memory quantile calculations that give up some of the precision requirements in favor of lower memory consumption. Three of these known techniques are now discussed.