1. Field of the Invention
The present invention relates generally to computer systems, and more particularly to systems and methods for determining one or more quantiles for large sets of data while minimizing the amount of memory needed for computation.
2. Description of the Related Art
Order statistics are used to characterize large sets of data either on line, or in storage. One significant statistic that provides information relevant to the characterization of large sets of data is the quantile. A quantile is the element at a specific position in a sorted sequence of data. Quantiles are of interest, for example, to both database designers and database implementers. In this regard, quantiles are of interest because they more realistically characterize distributions of real world data sets and are less sensitive to outlying data points than, e.g., the mean value of a data set, or the variance of a data set.
As but one example of when a quantile might be useful, a user might want to identify salespersons who are performing well (or poorly) using a personnel database that lists the total sales for each sales person. Because there are a few sales persons with exceptionally high sales (and no sales below zero), and because the distribution of total sales is not Gaussian or distributed in accordance with any other well-known statistical measure, using the average and standard deviation of total sales is not a good method for evaluating salesperson performance. Instead, the user needs to determine the value of the 0.95-quantile (0.05-quantile) of total sales and classify sales persons relative to that value. Similarly, housing prices in a region are typically reported in terms of the median sales price which is also the 0.50-quantile price because the sales of a few very high price homes tends to make the average home sales price not representative of the market as a whole.
Quantile determination has many other applications in the processing of scientific, business, and industrial information. Two such additional applications are database partitioning during parallel processing, and data mining. Thus, the skilled artisan will appreciate that determining quantiles is an important task for processing many if not most data sets.
Like many other computer tasks, the determination of quantiles must take into account several practical considerations. Specifically, quantiles should be generated while optimizing computational efficiency, minimizing the amount of computer main memory space consumed, and still producing an exact or at least quantifiably accurate approximate quantile.
First, for computational efficiency it is desirable that the determination of quantiles not require multiple passes over a data set to perform calculations. Indeed, limiting processing to only a single pass over a data set is highly desirable from a computational efficiency viewpoint. Processing data in only a single pass, however, is somewhat challenging in part because no assumptions can be made regarding the order of arrival of elements from a data set or their value distributions. Nevertheless, it is desirable that quantiles be generated in only a single pass without depending on assumptions about the arrival order of data for efficiency or correctness.
Additionally, as stated above the amount of memory required to find quantiles should be minimized. Thus, although one computationally efficient way to find quantiles of a data set would be to buffer the entire data set in memory and then process the set, this would require excessive memory and accordingly is not very desirable. Instead, it is desirable to conserve memory, while still promoting computational efficiency.
It is possible to conserve memory space and at the same time promote computational efficiency, by substituting approximate quantiles for exact quantiles, depending, of course, on the particular application. It would be desirable that the accuracy of an algorithm that finds approximate quantiles be tunable to the level of accuracy required for the application, with its performance degrading gracefully if at all when the accuracy requirements are increased.
Munro et al., in an article entitled "Selection and Sorting with Limited Storage" published in Theoretical Computer Science, 12:315-323 (1980) disclose finding an exact median while minimizing the number of passes over a data stream. However, because they seek to find an exact median, Munro et al. require more than one pass over the data set. As mentioned above, it is most desii able that only a single pass be required.
Agrawal et al., in an article entitled "A One-Pass Space-Efficient Algorithm for Finding Quantiles" published in Proc. 7th Int'l Conf. Management of Data (1995), disclose a method for finding approximate quantiles in a single pass over a data stream, but without any apriori guarantee on the approximation error. Without any apriori guarantee on the approximation error, practitioners are reluctant to use algorithms that produce approximate quantiles.