1. Field of Invention
The present invention relates generally to the field of distinct-value estimation. More specifically, the present invention is related to a method for estimating the number of distinct values in a partitioned dataset.
2. Discussion of Related Art
The task of determining the number of distinct values (DVs) in a large dataset arises in a wide variety of settings in computer science, including data integration (for example, see papers to Brown et al. entitled, “Toward automated large-scale information integration and discovery,” and Dasu et al. entitled, “Mining database structure; or, how to build a data quality browser,”) query optimization (for example, see papers to Ioannidis et al, entitled, “The history of histograms,” and the paper to Selinger et al. entitled, “Access path selection in a relational database management system,”) network monitoring (for example, see paper to Estan et al. entitled, “Bitmap algorithms for counting active flows on high speed links,”) and OLAP (for example, see papers to Padmanabhan et al. entitled, “Multi-dimensional clustering: a new data layout scheme in DB2,” and the paper to Shukla et al. entitled, “Storage estimation for multidimensional aggregates in the presence of hierarchies”). The number of distinct values can be computed exactly by sorting the dataset and then executing a straightforward scan-and-count pass over the data; alternatively, a hash table can be constructed and used to compute the number of distinct values. Neither of these approaches scales to the massive datasets often encountered in practice, due to heavy time and memory requirements. A great deal of research over the past twenty five years has therefore focused on approximate methods that scale to very large datasets. These methods work either by drawing a random sample of the data items and using the observed frequencies of the values in the sample as a basis for estimation (see, for example, the paper to Charikar et al. entitled, “Towards estimation error guarantees for distinct values,” the paper to Haas et al. entitled, “An estimator of the number of species from quadrat sampling,” and the paper to Haas et al. entitled, “Estimating the number of classes in a finite population”) or by taking a single pass through the data and using hashing techniques to compute an estimate using a bounded amount of memory (see, for example, the paper to Alon et al. entitled, “The space complexity of approximating the frequency moments,” the paper to Astrahan et al. entitled, “Approximating the number of unique values of an attribute without sorting,” the paper to Bar-Yossef et al. entitled, “Counting distinct elements in a data stream,” the paper to Durand et al. entitled, “Loglog counting of large cardinalities,” the paper to Estan et al. entitled, “Bitmap algorithms for counting active flows on high speed links,” the paper to Flajolet et al. entitled, “Probabilistic counting algorithms for data base applications,” the paper to Gibbons et al. entitled, “Distinct sampling for highly-accurate answers to distinct values queries and event reports,” the paper to Giroire entitled, “Order statistics and estimating cardinalities of massive data sets,” and the paper to Whang et al. entitled, “A linear-time probabilistic counting algorithm for database applications”).
Almost all of this work has focused on producing a single synopsis of the entire dataset and then using the synopsis to obtain a DV estimate; methods for combining and exploiting synopses in the presence of set operations on partitioned datasets are virtually nonexistent. The present invention provides DV estimation methods in the context of a partitioned dataset, such as the “synopsis warehouse” environment described in the paper to Brown et al. entitled, “Techniques for warehousing of sample data.” In a synopsis warehouse, incoming data is split into partitions, i.e., multisets of values, and a synopsis is created for each partition; the synopses are used to quickly estimate various partition properties. As partitions are rolled in and out of a full-scale warehouse, the corresponding synopses are rolled in and out of the synopsis warehouse. The architecture requires that synopses can be created in parallel, ensuring scalability, and that synopses can be combined to create a synopsis corresponding to the multiset union, intersection, or difference of the corresponding partitions, providing flexibility. The term “partition” is used here in a very general sense. Data may be partitioned—e.g., by time-stamp, by data value, and so forth—for purposes of parallel processing and dealing with fluctuating data-arrival rates. Data may also, however, be partitioned by its source—e.g., SAP customer addresses versus PeopleSoft customer addresses. In the latter scenario, comparison of data characteristics in different partitions may be of interest for purposes of metadata discovery and automated data integration (see, for example, the paper to Brown et al. entitled, “Toward automated large-scale information integration and discovery”). For example, DV estimates can be used to detect keys and duplicates in a partition, can help discover subset-inclusion and functional-dependency relationships, and can be used to approximate the Jaccard distance or other similarity metrics between the domains of two partitions (see, for example, the paper to Brown et al. entitled, “Toward automated large-scale information integration and discovery” and the paper to Dasu et al. entitled, “Mining database structure; or, how to build a data quality browser”).
Now, previously-proposed synopses, DV estimators, and methods for handling compound partitions are discussed.
Synopses for Dv Estimation
In general, the literature on DV estimation does not discuss synopses explicitly, and hence does not discuss issues related to combining synopses in the presence of multiset operations on the corresponding partitions. One can, however, directly infer potential candidate synopses from the various algorithm descriptions.
Bit-Vector Synopses
The oldest class of synopses comprises various types of bit vectors. The “linear counting” technique (see, for example, the paper to Astrahan et al. entitled, “Approximating the number of unique values of an attribute without sorting,” the paper to Estan et al. entitled, “Bitmap algorithms for counting active flows on high speed links,” and the paper to Whang et al. entitled, “A linear-time probabilistic counting algorithm for database applications”) uses a bit vector V of length M=O(D), together with a hash function h from  to {1, 2, . . . , M}, where  denotes the domain of the dataset of interest and D=|| is the number of distinct values in the dataset. The function h is applied to each element v in A, and the h(v)th bit of V is set to 1. After the dataset has been scanned, the estimate of D is the total number of 1-bits in V, multiplied by a correction factor. The correction factor compensates for undercounting due to “hash collisions” in which h(v)=h(v′) for v≠v′; see, for example, the paper to Astrahan et al. entitled, “Approximating the number of unique values of an attribute without sorting.” The O(D) storage requirement for linear counting is often prohibitive in applications where D can be very large, especially if multiple DV estimators must be maintained.
The “logarithmic counting” method of Astrahan et al. (in the paper entitled, “Approximating the number of unique values of an attribute without sorting”) and Flajolet et al. (in the paper entitled, “Probabilistic counting algorithms for data base applications”) uses a bit vector of length L=O(log D). The idea is to hash each of the distinct values in the dataset to the set {0,1}L of binary strings of length L, and keep track of r, the position (counting from the left, starting at 0) of the leftmost 0 bit over all of the hashed values. The estimate is roughly of the form 2r (multiplied by a certain factor that corrects for “bias” and hash collisions). This tracking of r is achieved by taking each hashed value, transforming the value by zeroing out all but the leftmost 1, and computing the bitwise-OR of the transformed values. The value of r is then given by the leftmost 0 bit in the resulting bit vector. In the complete algorithm, several independent values of r are, in effect, averaged together (using a technique called “stochastic averaging”) and then exponentiated. Alon et al. in the paper entitled, “The space complexity of approximating the frequency moments” analyze a variant of the logarithmic counting algorithm under an assumption of pairwise-independent hashing. Recent work by Durand and Flajolet in the paper entitled, “Loglog counting of large cardinalities” improves on the storage requirement of the logarithmic counting algorithm by tracking and maintaining r, the position of the leftmost 0, directly. The number of bits needed to encode r is O(log logD), and hence the technique is called LogLog counting.
The main drawback of the above bit-vector data structures, when used as synopses in the setting of a partitioned dataset, is that union is the only supported set operation. One must, e.g., resort to the inclusion/exclusion formula to handle intersections of partitions. As the number of set operations increases, this approach becomes extremely cumbersome, expensive, and inaccurate.
Several authors (for example, see the paper to Ganguly et al. entitled, “Tracking set-expression cardinalities over continuous update streams” and the paper to Shukla et al. entitled, “Storage estimation for multidimensional aggregates in the presence of hierarchies”) have proposed replacing each bit in the logarithmic-counting bit vector by an exact or approximate counter, in order to permit DV estimation in the presence of both insertions and deletions to a dataset. This modification does not ameliorate the inclusion/exclusion problem, however.
Random Samples
Another synopsis possibility is to use a random sample of the data items in the partition (see, for example, the paper to Charikar et al. entitled, “Towards estimation error guarantees for distinct values,” the paper to Haas et al. entitled, “An estimator of the number of species from quadrat sampling,” and the paper to Haas et al. entitled, “Estimating the number of classes in a finite population”). The key drawback is that DV estimates computed from such a synopsis can be very inaccurate, especially when the data is skewed or when there are many distinct values, each having a low frequency (but not all unique); see the paper to Charikar et al. entitled, “Towards estimation error guarantees for distinct values” for a negative result on the performance of sample-based estimators. Moreover, combining synopses to handle unions of partitions can be expensive (see, for example, the paper to Brown et al. entitled, “Techniques for warehousing of sample data”), and it appears that the inclusion/exclusion formula is needed to handle intersections.
Sample-Counting Synopsis
Another type of synopsis arises from the “sample counting” DV-estimation method, also called “adaptive sampling,” credited to Wegman (see the paper to Astrahan et al. entitled, “Approximating the number of unique values of an attribute without sorting” and the paper to Flajolet et al. entitled, “Adaptive sampling”). Here the synopsis for the dataset of interest comprises a subset of {h(v): vε}, where h is a hash function as before. In more detail, the synopsis comprises a fixed-size buffer that holds binary strings of length L=log(M), together with a “reference” binary string s, also of length L. The idea is to hash the distinct values in the dataset, as in logarithmic counting, and insert the hashed values into a buffer that can hold up to k>0 hashed values; the buffer tracks only the distinct hash values inserted into it. When the buffer fills up, it is purged by removing all hashed values whose leftmost bit is not equal to the leftmost bit of s; this operation removes roughly half of the hashed values in the buffer. From this point on, a hashed value is inserted into the buffer if and only if the first bit matches the first bit of s. The next time the buffer fills up, a purge step (with subsequent filtering) is performed by requiring that the two leftmost bits of each hashed value in the buffer match the two leftmost bits of the reference string. This process continues until all the values in the dataset have been hashed. The final DV estimate is roughly equal to K2r, where r is the total number of purges that have occurred and K is the final number of values in the buffer.
The algorithms in the paper to Bar-Yossef et al. entitled, “Counting distinct elements in a data stream,” the paper to Gibbons et al. entitled, “Distinct sampling for highly-accurate answers to distinct values queries and event reports,” and the paper to Gibbons et al. entitled, “Estimating simple functions on the union of data streams,” embody the same idea, essentially with a “reference string” equal to 00 . . . 0. Indeed, the number of purges in the sample-counting algorithm corresponds to the “die level” used in the above-described paper to Gibbons et al. One difference in these algorithms is that the actual data values, and not the hashed values, are stored: the level at which a data value is stored encodes the number of leading 0's in its hashed representation. In the paper to Gibbons et al. entitled, “Distinct sampling for highly-accurate answers to distinct values queries and event reports,” the stored values are augmented with additional information. Specifically, for each distinct value in the buffer, the algorithm maintains the number of instances of the value in the dataset (here a relational table) and also maintains a reservoir sample (see, for example, the paper to Vitter et al. entitled, “Random Sampling with a Reservoir”) of the rows in the table that contain the value. This additional information can be exploited to obtain approximate answers, with probabilistic error guarantees, to a variety of SELECT DISTINCT queries over a partition. Such queries include, as a special case, the SELECT COUNT(DISTINCT) query that corresponds to the desired DV estimate. In the paper to Bar-Yossef et al. entitled, “Counting distinct elements in a data stream,” the basic sample-counting algorithm is enhanced by compressing the stored values.
For sample-counting algorithms with reference string equal to 00 . . . 0, the synopsis holds the K smallest hashed values encountered, where K lies roughly between k/2 and k. The variability in K leads to inefficient storage and unstable DV estimates relative to the present invention.
The Bellman Synopsis
In the context of the Bellman system, the authors in the paper to Dasu et al. entitled, “Mining database structure; or, how to build a data quality browser” propose a synopsis related to DV estimation. This synopsis for a partition A comprises k entries and uses independent hash functions h1, h2, . . . , hk; the ith synopsis entry is given by the ith minHash value mi=minvε(A) hi(v), where (A) is the value domain of A. The synopsis for a partition is not actually used to directly compute the number of DVs in the partition, but rather to compute the Jaccard distance between partitions. When constructing the synopsis, each scanned data item in the partition incurs a cost of O(k), since the item must be hashed k times for comparison to the k current minHash values.
DV Estimators
Prior-art DV estimators have been provided in the context of a single (unpartitioned) dataset, so we discuss prior DV estimators in this setting. The present invention provides an estimator that is superior in this setting, and that also extends to the setting of set operations on multiple partitions.
The motivation behind virtually all DV estimators can be viewed as follows. If D points are placed randomly and uniformly on the unit interval (where D is assumed to be large), then, by symmetry, the expected distance between any two neighboring points is 1/(D+1)≈1/D, so that the expected value of U(k), the kth smallest point, is E[U(k)] k/D. Thus D≈k/E[U(k)]. The simplest estimator of E[U(k)] is simply U(k) itself, and yields the basic estimator:{circumflex over (D)}kBE=k/U(k) The simplest connection between the above idea and the DV estimation problem rests on the observation that a hash function often “looks like” a uniform random number generator. In particular, let v1, v2, . . . , vD be an enumeration of the distinct values in the dataset and let h be a hash function as before. For many hash functions, the sequence h(v1), h(v2), . . . , h(vD) will look like the realization of a sequence of independent and identically distributed (i.i.d.) samples from the discrete uniform distribution on {0, 1, . . . , M}. Provided that M is sufficiently greater than D, the sequence U1=h(v1)/M, U2=h(v2)/M, . . . , UD=h(vD)/M will approximate the realization of a sequence of i.i.d. samples from the continuous uniform distribution on [0,1]. This assertion requires that M be much larger than D to avoid collisions, i.e., to ensure that, with high probability, h(vi) ≠h(vj) for all i≠j. A “birthday problem” argument shows that collisions will be avoided when M=O(D2). It is assumed henceforth that, for all practical purposes, any hash function that is discussed avoids collisions. The term “looks like” is used in an empirical sense, which suffices for applications. Thus, in practice, the estimator {circumflex over (D)}kBE can be applied with U(k) taken as the kth smallest hash value, multiplied by a normalizing factor of 1/M. The estimator {circumflex over (D)}kBE is biased upwards for each possible value of D. The present invention provides an unbiased estimator that also has a lower mean-squared error (MSE) than {circumflex over (D)}kBE in the setting of a single dataset.
The estimator {circumflex over (D)}kBE was proposed in the paper to Bar-Yossef et al. entitled, “Counting distinct elements in a data stream,” along with conservative error bounds based on Chebyshev's inequality. Both the logarithmic and sample-counting estimators can be viewed as approximations to the basic estimator. For logarithmic counting, specifically the Flajolet-Martin algorithm, consider the binary decimal representation of the normalized hash values h(v)/M, where M=2L. For example, a hash value h(v)=00100110, after normalization, will have the binary decimal representation 0.00100110. It can be seen that the smallest normalized hash value is approximately equal to 2−r, so that, modulo the correction factor, the Flajolet-Martin estimator (without stochastic averaging) is ½−r, which roughly corresponds to {circumflex over (D)}1BE. Because this latter estimate is unstable, in that E[{circumflex over (D)}1BE]=∞, the final Flajolet-Martin estimator uses stochastic averaging to average independent values of r and hence compute an estimator Ê of E[log2 {circumflex over (D)}1BE], leading to a final estimate of {circumflex over (D)}=C2Ê, where the constant c approximately unbiases the estimator. (the new estimators are exactly unbiased.) For sample counting, suppose, without loss of generality, that the reference string is 00 . . . 0 and, as before, consider the normalized binary decimal representation of the hashed values. Thus the first purge leaves behind normalized values of the form 0.0 . . . , the second purge leaves behind values of the form 0.00 . . . , and so forth, the last (rth) purge leaving behind only normalized hashed values with r leading 0's. Thus the number 2−r (which has r−1 leading 0's) is roughly equal to the largest of the K normalized hashed values in the size-k buffer, so that the estimate K/2−r is roughly equal to {circumflex over (D)}kBE.
The paper to Giroire et al. entitled, “Order statistics and estimating cardinalities of massive data sets” studies a variant of {circumflex over (D)}kBE in which the hashed values are divided into m>1 subsets, leading to m i.i.d. copies of the basic estimator. These copies are obtained by dividing the unit interval into m equal segments; the jth copy of the basic estimator is based on all of the hashed values that lie in the jth segment, after shifting and scaling the segment (and the points therein) into a copy of the unit interval. (Note that for a fixed synopsis size k, each copy of the basic estimator is based on approximately k/m observations.) Each copy of the basic estimator is then subjected to a nonlinear transformation g, and multiplied by a correction factor c. The function g is chosen to “stabilize” the estimator, and the constant c is chosen to ensure that the estimator is asymptotically unbiased as k becomes large. Finally, the i.i.d. copies of the transformed estimators are averaged together. The motivation behind the transformation g is to avoid the instability problem, discussed previously, that arises when k=1. Later, the present invention's proposed estimator is shown to be unbiased for any values of D and k>1, while being less cumbersome to compute. Moreover, when D>>k>>0, the estimator provided by the current invention has approximately the minimum possible MSE, and hence is at least as statistically efficient as any estimator in the paper to Giroire et al. entitled, “Order statistics and estimating cardinalities of massive data sets.”
DV Estimators for Compound Partitions
As mentioned above, previous work has mostly focused on DV estimation for a single dataset. To allow for more scalable and flexible data processing, it is convenient to decompose a dataset into a collection of disjoint “base” partitions. A compound partition in a partitioned dataset is formed from two or more of the base partitions via one or more multiset union, multiset intersection, and multiset difference operations. To our knowledge, the only prior discussion of how to construct DV-related estimates for compound partitions is found in the paper to Dasu et al. entitled, “Mining database structure; or, how to build a data quality browser.” DV estimation for the intersection of partitions A and B is not computed directly. Instead, the Jaccard distance ρ=DA∩B/DA∪B (called the “resemblance” in the paper to Dasu et al. entitled, “Mining database structure; or, how to build a data quality browser”) is estimated first and then, using the estimator {circumflex over (ρ)}, the number of values in the intersection is estimated as
            D      ^              A      ⋂      B        =                    ρ        ^                              ρ          ^                +        1              ⁢                  (                              D            A                    +                      D            B                          )            .      Here and elsewhere, DX denotes the number of distinct values in partition X.
The quantities DA and DB are computed exactly, by means of GROUP BY queries; the present invention provides estimators that avoid the need to compute or estimate these quantities. There is no discussion in the paper to Dasu et al. entitled, “Mining database structure; or, how to build a data quality browser” of how to handle any set operations other than the intersection of two partitions. If one uses the principle of inclusion/exclusion to handle other set operations, the resulting estimation procedure will not scale well as the number of operations increases. The present invention's methods handle arbitrarily complex combinations of operations on partitions (multiset unions, intersections, and differences) in an efficient manner.
Whatever the precise merits, features, and advantages of the prior art, none of them achieves or fulfills the purposes of the present invention.