The present invention relates to a method and system for estimating an arbitrary probabilistic location for an unknown distribution using a series of computations based upon a small fraction of the original data set. For univariate densities, a quantile is an example of a probabilistic location.
The problem of estimating an arbitrary probabilistic location for an unknown distribution is commonly studied, given its relevance to numerous industries and applications. Being able to determine or estimate an arbitrary probabilistic location is useful because probabilistic locations can be used to efficiently summarize a distribution. Often, probabilistic locations are needed for the tails of a distribution (i.e. parts of the distribution that are far away from the mode(s) of the distribution). For example, in a univariate case, where minimum or maximum performance standards need to be monitored, the probabilistic location is often near the extreme ends of the distribution.
One of many areas where estimating probabilistic locations is useful is in database management systems. In database management systems, a query of a database can return a number of results. Probabilistic location estimation is useful as it provides an estimation of the number of results of a particular query. Given an accurate probabilistic location estimate, the amount of time taken for the query can be reduced by structuring the query appropriately to reduce the number of results. See e.g. U.S. Pat. No. 5,664,171 and U.S. Pat. No. 5,864,841, both to Agrawal et al. Another area where monitoring probabilistic locations is useful is in monitoring data flowing down a channel, for example data packets flowing through a network.
Generally, problems with prior art methods of probabilistic location estimation have included the size of the required data set (and corresponding amount of memory required), the lack of accuracy of estimates (particularly with respect to the extreme ends), and the computational efficiency of the method.
Several methods for estimating quantiles currently exist, all of which have problems. These problems include the impracticality of estimating extreme quantiles as well as the inability to be easily extended to arbitrary quantiles without ignoring part of the data to get a desired sample size, and restriction on the size of data sets. Many methods focus primarily on estimating the median of the distribution, and only the Stochastic Approximation (“S.A.”) method is easily extended to estimate an arbitrary quantile. The main drawbacks of the S.A. methods are that its accuracy depends on an initial sample and it allows for estimates which are outside the range of possible values. Thus, the S. A. also performs poorly when estimating extreme quantiles.
Other methods for estimating quantiles exist. For example the method proposed in U.S. Pat. No. 6,343,288 and U.S. Pat. No. 6,108,658, both to Lindsay et al. Lindsay et al. disclose propose a method of estimating a quantile that requires only a single pass over the data set and does not require knowledge of the size of the data set. Despite these advantages, the Lindsay et al. methodology still requires significant processing and significant memory requirements.
Quantile estimation for an unknown distribution is a commonly studied problem. Pfanzagl (1974) showed that when nothing is known about a distribution of interest, the sample quantile has the minimum asymptotic variance among translation invariant estimators of the population quantile. While it may be desirable, using the sample quantile as an estimate of the population quantile becomes cumbersome and in many cases impractical to obtain, both in terms of storage space and computation time, when the size of the data set becomes large. In this specification we introduce a single-pass, low-storage method of estimating an arbitrary quantile, based on a sequential scoring algorithm that combines estimated ranks and assigned weights, where the weights represent, in some sense, the information associated with each estimated rank.
Massive datasets are becoming more and more common in modem society. They arise from sources as diverse as large call centers, internet traffic data, sales transactional records, or satellite feeds. Thus there is a clear need to be able to process the data accurately and efficiently so that current analyses may be performed before becoming inundated by a continually growing store of data.
Applications of the present invention include, but are not limited to, query optimization for large databases and network routing problems. Manku, Rajagopalan, and Lindsay (1998) note that it is common in the database field to keep summaries of the variables in the form of equi-depth histograms. However, creating and maintaining these histograms can be quite costly. Another application of the present invention is in the area of network routing. Network routing decisions are improved by having more accurate summaries of the distributions of the historical traffic data, in particular of the tails of these distributions such as is provided by the present invention. A further application, as noted in Dunn (1991), is in the computation through simulation of critical values and percentile points of new statistics whose distributions are unknown. A further application is in the area of MCMC estimation where simulations routinely generate massive amounts of data. The present invention contemplates these and other applications.
We start our discussion by putting forward notation and definitions that will be used throughout the specificationpaper. Let X1, . . . ,Xn be a sample from a distribution F, where we assume F is continuous so that all observations are unique with probability 1. Let the order statistics X(1)< . . . <X(n) be the observations arranged in ascending order. The pth population quantile of a distribution F is defined asξnp=F−1(p)=inf{x:F(x)≧p},and the pth sample quantile as ξp=X(k), where k=┌np┐ is the smallest integer greater than or equal to np, for 0<p<1. Hence a sample quantile can be attained by simply sorting the data and taking the appropriate order statistic. However, as the size of the dataset becomes large, computation and storage burdens make this method infeasible.
Hurley and Modarres (1995) offer a nice survey of current methods for estimating quantiles. Most of the methods reviewed in this survey focus on estimating the median of a distribution, and in practice only one method, the Stochastic Approximation (S.A.) method introduced in Tierney (1983), is easily extended to estimate an arbitrary quantile. In addition to reviewing current methods, Hurley and Modarres (1995) introduce a histogram based method for estimating quantiles. Their proposed method has many attractive qualities, in particular for estimation of the median. However, for estimation of quantiles other than the median their method has a non-zero probability of having to be repeated, and hence requiring more than one pass through the data set, in order to obtain an appropriate estimate of the quantile. Extending their method so that it can be used to estimate extreme quantiles (quantiles with values of p close to 0 or 1), would result in an increased probability of requiring more than one pass through the data set, making it impractical for estimating extreme quantiles.
Pearl (1981) proposed using a minimax tree to estimate an arbitrary quantile. While this method is easy to implement and utilizes very little storage space, it has the drawback that it will only work for sample sizes that can be specified in terms of the three parameters which describe the tree. As a result this method cannot easily be extended to arbitrary quantiles without ignoring part of the data in order to get a desired sample size. Rousseeuw and Bassett (1990) proposed the remedian method of quantile estimation. As with the minimax tree method, there are restrictions on the size of data sets that can be analyzed using this method. The remedian method can be extended to other quantiles (see Chao and Lin (1993)), however this extension is not easily accomplished in practice.
Alternatively, the S.A. method proposed by Tierney (1983) is quite accurate, straightforward to implement for arbitrary sized datasets, and easily extensible to estimate arbitrary quantiles. The main drawbacks of the S.A. method are that its accuracy depends on an initial sample and it allows for estimates which are outside of the range of possible values. Because the accuracy depends on getting an initial sample that has a quantile that is close to the sample quantile of the entire data set, the S.A. performs poorly when estimating extreme quantiles. This is a weakness that can only be overcome by increasing the size of the initial sample which can lead to the same challenges associated with the sample quantile. With regards to the bounds of the S.A. estimator, if one were estimating a left tail quantile for data generated by a X12 distribution, there is nothing to prevent this estimator from returning a negative value for an estimate, since the method doesn't return an actual element of the data set.
We note here that all of the methods under consideration here use a small fixed amount of storage. We will not be considering at this time methods that use a non-fixed amount of storage. An example of a method in this category is given by Dunn (1991). All of the existing methods considered here perform very accurately and efficiently for median estimation. Some, such as the S.A. method and histogram based method, are well suited to handle datasets of any size whereas others are better suited to situations where the sample size is large but static. Further, Tierney's S.A. method is easily extensible to estimate arbitrary quantiles although for tail quantiles the variability increases as one moves further out into the tails. Although these methods are in general very good, we feel that there is room for improvement in the area of tail quantile estimation in general and in particular with regard to the variability of the estimators relative to that of the sample quantile.
Therefore it is a primary object, feature, or advantage of the present invention to improve upon the state of the art.
Another object, feature, or advantage of the present invention is to provide a method and system for estimating a probabilistic location.
A further object, feature, or advantage of the present invention is to provide a method and system for estimating a probabilistic location that estimates accurately.
A still further object, feature, or advantage of the present invention is to provide a method and system for estimating a probabilistic location that is efficient.
Another object, feature, or advantage of the present invention is to provide a method and system for estimating a probabilistic location that does not require knowledge of the size of the data set.
Yet another object, feature, or advantage of the present invention is to provide a method and system for estimating a probabilistic location that requires only a small amount of storage.
A further object, feature, or advantage of the present invention is to provide a method and system for estimating a probabilistic location that is capable of use with massive data sets.
These and other objects, features, and advantages of the present invention will become apparent from the specification and claims that follow.