1. Field of the Invention
The present invention relates generally to computer database systems, and more particularly to systems and methods for finding quantiles in a data stream.
2. Description of the Related Art
Quantiles, which are elements at specific positions in a sorted data stream or database, are of interest to both database users, designers, and implementers. One reason quantiles are of interest is that they characterize distributions of real world data sets and are less sensitive to outlying data points than are, e.g., the mean value of a data stream, or the variance of a data stream.
As but one example of when a quantile might be useful, a user might want a listing, from a personnel database, of salespeople who are taller than a certain height and who have gross sales above a certain amount. The user would request this information by means of a database query. It is the function of a database management system (dbms) to respond to the query quickly and efficiently. In responding to the query, the dbms typically must reformat the query into a more efficient equivalent query. Then, the dbms evaluates which one of several potential query execution plans would be the most computationally efficient in executing the equivalent query. Because the difference in computational time between an efficient query execution plan and an inefficient plan can be great, it is worthwhile for the dbms to undertake the above-mentioned evaluation.
This is where a knowledge of quantiles in the database can be useful. It happens that in evaluating the efficiency of query execution plans, a dbms relies on statistics that relate to the requested data, and one important statistic is quantiles. To illustrate, suppose in the above example that the amount of gross sales of interest is $500,000, and suppose further that the database contains 100,000 personnel records. If $500,000 is at the 80% quantile of gross sales, the dbms can be assured that at most its response to the query will have 20,000 records, which statistical information is important for generating and evaluating good query plans.
In addition to the above application of quantiles, the ability to determine quantiles has many other applications in the database field. Two such additional applications are database partitioning during parallel processing, and database mining. Thus, the skilled artisan will appreciate that determining quantiles is an important task for many if not most dbms.
Like many other computer tasks, the determination of quantiles must satisfy several practical considerations. Specifically, quantiles should be generated while minimizing the amount of memory space consumed, optimizing computational efficiency, and still producing an exact or at least highly accurate approximate quantile.
First, for computational efficiency it is desirable that the determination of quantiles not require excessive passes over a data stream to sort the data stream. Indeed, requiring only a single pass over a data stream is highly desirable from a computational efficiency viewpoint. Processing data in only a single pass, however, is somewhat challenging in part because no assumptions or guarantees can be made regarding the order of arrival of elements in a data stream or their value distributions. Nevertheless, it is desirable that quantiles be generated in only a single pass without depending on assumptions about the data stream for efficiency or correctness.
Additionally, as stated above the amount of memory required to find quantiles should be minimized. Thus, although one computationally efficient way to find quantiles of a data stream would be to buffer the entire stream in memory and then process the stream, this would require excessive memory and accordingly is not very desirable. Instead, as recognized by the present invention it is desirable to conserve memory, while still promoting computational efficiency.
As also recognized by the present invention, to conserve memory space and at the same time promote computational efficiency, approximate quantiles can be substituted for exact quantiles, depending, of course, on the particular application. For this reason, the present invention recognizes that the accuracy of an algorithm that finds approximate quantiles should be tunable to the level of accuracy required for the application, with its performance degrading gracefully if at all when the accuracy requirements are increased.
In the above-referenced patent application, a method for generating approximate quantiles is disclosed that, unlike the method of Munro et al., in an article entitled xe2x80x9cSelection and Sorting with Limited Storagexe2x80x9d published in Theoretical Computer Science, 12:315-323 (1980), advantageously does not require more than one pass over the data stream and further, unlike the method disclosed in Agrawal et al., in an article entitled xe2x80x9cA One-Pass Space-Efficient Algorithm for Finding Quantilesxe2x80x9d published in Proc. 7th Int""l Conf. Management of Data (1995), advantageously guarantees a bound on the approximation error. The method of the above-referenced patent application does, however, require that the size xe2x80x9cNxe2x80x9d of the input stream be known a priori.
As recognized by the present invention, in practice the size xe2x80x9cNxe2x80x9d of the input stream in fact might not be known at the outset. As an example, the input stream might be an intermediate table, the size of which might only be crudely estimated, if at all, prior to quantile computation. When the estimate for xe2x80x9cNxe2x80x9d is bad, the quantile-generating algorithms of previous methods might fail to provide the required approximation guarantee, or indeed might fail to complete execution altogether.
Fortunately, the present invention understands that a scalable, parallelizable, single-pass algorithm can be provided for generating approximate quantiles within predefined error bounds, even when the size xe2x80x9cNxe2x80x9d of the input stream is not known beforehand, while minimizing memory size requirements. As set forth more fully below, random, non-uniform sampling of the input stream can be used to achieve this result while minimizing memory space overhead.
A method is disclosed for determining at least one approximate quantile of a number of elements in a data set in a single pass over the elements while minimizing memory usage and meeting a desired approximation guarantee with a given probability without knowing the number of elements. At least some of the elements may be sampled non-uniformly, and sampled elements are used to fill input buffers. The number and size of the buffers depend at least on the approximation guarantee (and, preferably, the given probability) but not on the number of elements in the data set.
One or more approximate quantiles are output such that the approximate quantiles meet the approximation guarantee with the given probability.
More rigorously, given user-specified approximate quantile xcfx86, user-specified approximation error xcex5, and user-specified probability xcex4, the present invention computes, in a single pass over a data set of unknown size, an xcex5-approximate xcfx86-quantile with a probability of 1xe2x88x92xcex4. The xcfx86-quantile of a data set of size N, for xcfx86xcex5[0,1], is defined to be the data element at position ┌xcfx86N┐ in the sorted sequence of the data set. An xcex5-approximate xcfx86-quantile is defined to be any element of the data set whose position lies between the element at position ┌(xcfx86xe2x88x92xcex5)N┐ and the element at position ┌(xcfx86+xcex5)N┐ in the sorted sequence of the data set. As understood herein, several elements of the data set can qualify as an xcex5-approximate xcfx86-quantile. The value xcex4xcex5┌0,1┐ denotes the probability that the present invention fails to report an xcex5-approximate xcfx86-quantile. Typically, xcex4 lies in the range 0.01 to 0.0001.
From another aspect, the invention is a general purpose computer programmed according to the inventive steps herein to determine a desired approximate xcfx86-quantile for elements in a data stream of unknown size, within a user-specified approximation error xcex5 and with a user-specified probability of at least 1xe2x88x92xcex4. The invention can also be embodied as an article of manufacturexe2x80x94a machine componentxe2x80x94that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to execute the present logic. This invention is realized in a critical machine component that causes a digital processing apparatus to perform the inventive method steps herein.
The invention can be implemented by a computer system including a general purpose computer and one or more input devices associated with the computer for generating a user specification. The specification establishes one or more desired approximate xcfx86-quantiles, a quantile approximation error xcex5, and a probability of failure xcex4, such that each approximate xcfx86-quantile is guaranteed to represent a true quantile of a data set and to lie within the quantile approximation error xcex5 with a probability of at least 1xe2x88x92xcex4. The system also includes a data set having a size that is unavailable to the computer in advance. Further, the system includes computer usable code means that are executable by the computer for determining an xcex5-approximate xcfx86-quantile data element in the data set. As set forth in detail below, the computer usable code means include means for determining a number b of buffers and a size k of each buffer, and a number h, based at least in part on the permissible approximation error xcex5 and the probability of failure xcex4. Also, means are provided for sampling the data set based at least in part on the number h to establish sampled data elements for populating buffers. Moreover, means fill empty buffers with sampled data elements to establish a plurality of input buffers, and then means collapse data elements in input buffers into at least one output buffer. Means are provided for outputting, from an output buffer, at least one xcex5-approximate xcfx86-quantile data element.
In one preferred embodiment, the means for sampling determines a sampling rate r based at least in part on the number h. Also, the means for collapsing can be represented by a data tree defining an integer number of levels, and the system further includes means for establishing a level integer l to be the lowest level of fall buffers in the data tree. The filling means is invoked when one or more empty buffers exist, with the level integer l being incremented by unity at least n times, nxe2x89xa71, when exactly one empty buffer exists. Otherwise, the level integer l is not incremented. Each empty buffer is associated with the level integer l. Means collapse buffers at level l when no empty buffers exist, with the resulting output buffer being associated with the integer l+1.
As also disclosed below in relation to the preferred embodiment, the sampling means sets the sampling rate r equal to unity when the largest level L assigned to any buffer is less than the number h. Otherwise, the means for sampling sets the sampling rate r equal to xc2xdL+1xe2x88x92h. Accordingly, the means for sampling samples the data set at least part of the time non-uniformly. If desired, the computer usable code means dynamically allocates the input buffers.
An output buffer is used as an input buffer for a successive collapsing operation, and the means for collapsing is invoked when all buffers contain k data elements. In a particularly preferred embodiment, the means for collapsing includes means for sorting data in at least some input buffers X1, . . . Xc, with each input buffer defining a respective weight wi that is representative of the number of data elements represented by each element of the input buffer. Selecting means sort data elements from the input buffers for merging, and then means repeatedly increment a counter wi times in response to the means for selecting. Furthermore, an element from an ith input buffer is designated as an output buffer element when the counter is at least as large as a predetermined value. Means collect elements designated as output buffer elements into an output buffer, and then designate the input buffers as empty and the output buffer as a full input buffer for then reinvoking the means for filling to fill with data elements input buffers designated as empty, and the full output buffer is usable as an input buffer by the means for collapsing.
In addition to the above, the means for determining b, k, and h minimizes the product b*k subject to at least one constraint. Preferably, the constraint is a function at least of the permissible approximation error xcex5 and the probability of failure xcex4. In one implementation, the computer usable code means is implemented in a database management system.
In another aspect, for an input data set having at least one true xcfx86-quantile data element, a computer-implemented method is disclosed for generating, in a single pass over the input data set without knowledge of the size of the data set, one or more approximate xcfx86-quantile data elements respectively representative of the true xcfx86-quantile data elements. The approximate xcfx86-quantile data elements differ from the respective true xcfx86-quantile data elements by no more than a user-defined approximation error xcex5 with a probability of 1xe2x80x94(a user defined probability of failure xcex4). The method includes establishing b buffers, each having a capacity to hold k data elements. B and k are integers that are related to the approximation error xcex5 and the probability of failure xcex4. The method also includes alternately filling empty buffers with elements from the data set to establish input buffers and then storing only a subset of the elements in the input buffers into one or more output buffers until the entire input data set is processed, with at least one of the elements being output as the approximate xcfx86-quantile.
In still another aspect, a computer program device includes a computer program storage device that is readable by a digital processing apparatus, and a program on the program storage device that includes instructions which are executable by the digital processing apparatus for determining at least one desired approximate xcfx86-quantile data element for elements in a data set within at least one user defined approximation error xcex5 with a probability of at least 1xe2x88x92xcex4. The method can be undertaken without using the size of the data set. The method that is undertaken by the program device includes filling at most b empty buffers with at most k elements in the input data set to establish at least some input buffers, with b and k being related to the approximation error xcex5 and to a probability of failure xcex4 and unrelated to the size of the data set. A subset of the elements in the input buffers is stored in at least one output buffer, and an element is identified in a final output based on the desired approximate xcfx86-quantile data element.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: