The present invention relates to query processing of textual data stored in miultidimensional data sets and, more particularly, to a method of estimating co-occurrence of query substrings across the dimensions of the data set.
In recent years, a new suite of services (such as LDAP directory services), standards (such as XML), and applications (such as E-commerce) have emerged, due at least in part to the proliferation of the Internet. Handling large amounts of text (as opposed to numeric data) is central to such Internet-related technology. Thus, there has been a resurgence of interest in the storage, management and query processing of textual data.
In many applications involving databases that process textual data, users pose (sub)string queries, which may either search for exact matches, or contain wildcards. For example, in an E-commerce catalog search, a user might inquire about all teal colored polo shirts that were sold during the month of June, 2001xe2x80x94this would be an example of an exact match on two separate attributes, namely, color and sale status (i.e., a two-dimensional query). In an LDAP directory search, a user might inquire about all people whose first name begins with the letters xe2x80x9cJoxe2x80x9d (i.e., a xe2x80x9cprefixxe2x80x9d match), and whose phone number contains the sequence of digits xe2x80x9c360xe2x80x9d (a xe2x80x9csubstringxe2x80x9d match). These examples can be expressed in structured query language (SQL) using the LIKE clause: WHERE color LIKE xe2x80x98tealxe2x80x99 AND sales LIKE xe2x80x98062001xe2x80x99 and, respectively, WHERE name LIKE xe2x80x98Jo%xe2x80x99 AND phone LIKE xe2x80x98%360%xe2x80x99. It is to be noted that the queries can specify any combination of exact, prefix, suffix, or proper substring matches.
In order to optimize such queries, particularly multidimensional queries, it is often useful to obtain fast and accurate estimates for their result sizes. One problem in the field of multidimensional substring selectivity estimation is related to the estimation of the fraction of tuples in the database that contain the query string as a substring in each dimension (also referred to as xe2x80x9cattributexe2x80x9d), where prefix and suffix constraints can easily be reduced to substring constraints. Such an estimate may also suffice as an approximate answer to a COUNT query that returns the number of tuples in the database which contain the query as a substring in each attribute. Further, fast and accurate estimates for multidimensional string selectivity estimation may also help in refining queries in an online data analysis environment. As in any selectivity estimation, multidimensional string estimation methods must provide acceptable accuracy (for example, no more than 10-20% error) using a data structure that is allowed only a limited amount of space (i.e., small enough to reside in a main memory, but typically, significantly smaller). In addition, the time needed to build the data structure should not be prohibitive so that periodic rebuilding is feasible. Finally, online estimation must be fast (on the order of a fraction of a second), even when working with a massive data structure.
Multidimensional selectivity estimation has been an active area of research for many years. Most work in this area, however, has focused on numerical attributes and has assumed the existence of a mapping from multidimensional categorical data to fully ordered domains. In the case of multidimensional string queries, however, such mapping is of no use. For example, if the strings are sorted lexicographically, the substrings are not necessarily ordered. In the extreme of presuming that all substrings are explicitly represented and the frequency of individual points is approximated in the multidimensional space using standard techniques, the domain would be so large and the frequencies so small as to render the techniques impractical. An end-biased histogram approach may be used as an alternative, retaining only the substrings with the highest counts, subject to constraints on space, and approximating the other substrings assuming uniformity. Since the total number of substrings is very large, this approach would be very close in accuracy to one that makes the uniformity assumption, which is known to be highly inaccurate. Moreover, the need to retain all substrings with high counts becomes aggravated as the dimensionality increases.
New approaches for selectivity estimation tailored to the string domain have been developed in recent years. These approaches share a common framework of first performing a precomputation to store the number of tuples that contain the most frequently co-occurring substrings, defined as the xe2x80x9ccross-countsxe2x80x9d. Online estimation then involves parsing the query into subqueries such that the cross-count for each subquery is available from the precomputation process. The effectiveness of any particular approach within this framework relies on the prudent utilization of cross-counts.
One exemplary prior art technique for estimating cross-counts of string data is based on a variant of suffix trees and is referred to as the xe2x80x9ccount-suffix treesxe2x80x9d method, where each node in the tree is augmented with the count of occurrences of its associated substring. The method maintains k suffix trees (one for each dimension), and a matrix for storing the cross-counts, that is, the number of occurrences of all substring combinations from each dimension. In order to limit the space used, a xe2x80x9cpruning thresholdxe2x80x9d may be defined, and each suffix tree pruned to form a Pruned Suffix Tree (PST), each PST having at most m nodes (where, for simplicity, the value of m is uniform across the dimensions). FIG. 1 illustrates the pair of PSTs for the data set: (ab, 12), (abc, 123), (bc, 123), (ab, 23). FIG. 1(a) includes the PST for the alpha data dimension, and FIG. 1(b) includes the PST for the numeric data dimension. Each node is defined by its substring, and illustrated within each node is the number of occurrences of that substring in the data set. In this case, the alpha data dimension is parsed into ab, b, and c, and the numeric data dimension is parsed into 12, 2 and 3. This parsing is a form of xe2x80x9cgreedy parsingxe2x80x9d (i.e., includes overlap), and is illustrated diagrammatically in FIG. 2(a).
FIG. 3 contains a matrix of each of the cross-counts between the parsed substrings. For example, the cross-count between ab and 2 is xe2x80x9c3xe2x80x9d, meaning that there are three different substrings that include the combination of ab with 2. As with the trees illustrated in FIG. 1, the matrix of FIG. 3 is also xe2x80x9cprunedxe2x80x9d, as shown by the line through the row of cross-counts associated with the substring 123 and the column of cross-counts associated with the substring abc.
Using the above example of a multidimensional (2-dimensional) query qxe2x80x94(abc, 123), abc is parsed into pieces ab and c; 123 is parsed into pieces 12 and 3, where the query is used to determine the xe2x80x9ccross-countxe2x80x9d between abc and 123. The subqueries resulting from the greedy parsing come from the cross-product of the pieces: (ab, 12), (c, 12), (ab, 3) and (c, 3). The associated subquery selectivities are then multiplied based on the independence assumption as follows:                               Pr          ⁢                      {                          (                              abc                ,                123                            )                        }                          =                  xe2x80x83                ⁢                  Pr          ⁢                      {                          (                              ab                ,                12                            )                        }                    xc3x97          Pr          ⁢                      {                          (                              c                ,                12                            )                        }                    xc3x97          Pr          ⁢                      {                          (                              ab                ,                3                            )                        }                    xc3x97          Pr          ⁢                      {                          (                              c                ,                3                            )                        }                                                  =                  xe2x80x83                ⁢                                            2              /              4                        xc3x97                          2              /              4                        xc3x97                          2              /              4                        xc3x97                          2              /              4                                =                      1            /            16.                              
Of course, this solution of xe2x80x9c{fraction (1/16)}xe2x80x9d is an exact solution, since this data set contains only four elements and it is straightforward to calculate each quantity.
An alternative method known in the prior art is denoted xe2x80x9ck-d count-suffix treesxe2x80x9d (or xe2x80x9cKDxe2x80x9d), in which each node corresponds to substring combinations from each dimension, and each node is augmented with the substring counts. FIG. 4 illustrates an exemplary pruned k-d count-suffix tree for the data set example defined above (that is, for the four element data set (ab, 12), (abc, 123), (bc, 123), (ab, 23)). Since the capture of all combinations would require space exponential in k, a pruned data structure is generally employed, where an exemplary pruning cut line is shown in FIG. 4. In the KD method, a query q is parsed using the maximal overlap (MO) principle, where in a greedy fashion the query q is broken into a plurality of subqueries overlapping in multiple dimensions. Each subquery is defined as the match in the tree that most overlaps the prior subquery. The cross-count for each such subquery is obtained by a simple lookup in the k-d tree for the node associated with the subquery. These cross-counts can then be combined using conditional probabilities of the subqueries based on the inclusion-exclusion principle from set theory. In particular,                               Pr          ⁢                      {                          (                              abc                ,                123                            )                        }                          =                  xe2x80x83                ⁢                              Pr            ⁢                          {              I              }                        xc3x97            Pr            ⁢                          {              II              }                        xc3x97            Pr            ⁢                          {              III              }                        xc3x97            Pr            ⁢                          {                              I                ⋂                II                ⋂                III                            }                                            Pr            ⁢                          {                              I                ⋂                II                            }                        xc3x97            Pr            ⁢                          {                              I                ⋂                III                            }                        xc3x97            Pr            ⁢                          {                              II                ⋂                III                            }                                                              =                  xe2x80x83                ⁢                                                            2                4                            xc3x97                              2                4                            xc3x97                              2                4                            xc3x97              1                                                      3                4                            xc3x97                              3                4                            xc3x97                              3                4                                              =                      8            27                              
Both of the above-described prior art methods share the following common framework: (1) preprocessing: a compact structure is constructed by these methods to store cross-counts of substring occurrences in the database, employing pruning techniques to reduce the size; (2) query parsing: a query q is parsed into smaller subqueries qusb that match nodes in the pruned structure and have associated cross-counts; (3) cross-count lookup: the count cj associated with each qsub, that is, the number of times qsub occurs in the database, which is determined from the stored structure; and (4) probabilistic estimation: a probabilistic formula is used to algebraically xe2x80x9ccombinexe2x80x9d the cj""s to derive the selectivity estimate Pr(q). For example, in the first method, preprocessing builds PSTs and the cross-count matrix, parsing is greedy, cross-count lookup involves matrix look-up, and estimation is a product operation. In the latter method (the KD method), preprocessing builds a pruned k-d count-suffix tree, parsing is based on maximal overlap, cross-count look-up involves traversing the tree, and estimation is a more sophisticated production operation given by inclusion-exclusion.
While both of these methods are effective when working with only two dimensions and a relatively small number of tuples in the data set, they suffer from a xe2x80x9cspace dimensionality explosionxe2x80x9d, since both methods may consume an exponential amount of storage space for storing cross-counts. In order to cope with space constraints, pruning is applied as discussed above. In doing so, these methods then only maintain cross-counts for combinations of short substringsxe2x80x94as dimensions increase, these substrings get significantly shorter. As a result, the probability of locating any given subquery is small and queries wind up being parsed into many small subqueries. Thus, these known methods end up relying on probabilistic estimation for their overall accuracy, where this has proven to be inaccurate.
Thus, a need remains in the prior art for a reasonable method of determining cross-counts in multidimensional data sets that neither xe2x80x9cexplodesxe2x80x9d as the number of dimensions increase nor yields results that are inaccurate and unable to be used.
The need remaining in the prior art is addressed by the present invention, which relates to multidimensional substring selectivity estimation and, more particularly, to an estimation technique that utilizes set hashing to represent the tuples containing a given substring and the generation of cross-counts, in real time, from an intersection of the hash values.
The methodology of the present invention is set-oriented, where pruned suffix tress (PSTs) are first constructed on each dimension. A set is created for each node in each tree, the set containing the row identifiers (RIDs) of each tuple containing that node in its substring. Since each of these sets may be as large as N (the number of tuples in the database), a hash function is used to compact the set value. Cross-counts are then approximated by estimating particular hash value set intersections.