The present invention relates generally to methods of querying a database. More particularly, the present invention relates to systems and methods for estimating substring selectivity in a database.
With the growing importance of the Web, LDAP directory servers, and other text-based information stores, there is an ever greater need to evaluate queries involving string matching. One often wishes to obtain a quick estimate of the number of times a particular substring occurs in a database. A traditional application is for optimizing structured query language (SQL) queries with the xe2x80x9clikexe2x80x9d predicate (e.g. search for all names like xe2x80x9cjonesxe2x80x9d); such predicates are pervasive in data warehouse queries, because of the presence of xe2x80x9cuncleanxe2x80x9d data. Another example of a substring query is a wildcard query, which allows searches for partial matches on strings submitted as a query.
A commonly used data structure for indexing substrings in a database is the suffix tree, which is a trie that satisfies the following property: whenever a string is stored in the trie, then all suffixes of the string are stored in the trie as well. Given a substring query, one can locate all the desired matches using the suffix tree. A count-suffix tree is a variant of the suffix tree, which does not store pointers to occurrences of the substrings, but just keeps a count at the node corresponding to the substring in the tree. The storage requirements, nevertheless, of a full count-suffix tree can be prohibitive. Thus, methods have been proposed for estimating substring selectivity using another variation of the suffix tree: a pruned count-suffix tree (xe2x80x9cPSTxe2x80x9d) which retains only those substrings, and their counts, for which the count exceeds some prune threshold. FIG. 1 sets forth an example PST, with a prune threshold of 5. Labels are presented for (among others) substrings related to the database string xe2x80x9cjonesxe2x80x9d, with counts shown in parentheses for some of the nodes in the PST. Variations of basic selectivity estimation algorithms have been proposed that, given a PST and a substring query, estimate the fraction of the count associated with the root of the tree that satisfies the substring query. See Krishnan, et al., xe2x80x9cEstimating Alphanumeric Selectivity in the Presence of Wildcards,xe2x80x9d Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 282-93, 1996, hereby incorporated by reference for background purposes (the methods collectively referred to herein as the xe2x80x9cKVI algorithmxe2x80x9d).
Histograms have also long been used for selectivity estimation in databases. They have been designed to work well for numeric attribute value domains, and one can obtain good solutions to the histogram construction problem using known techniques. For string domains, however, a histogram bucket that includes a range of consecutive lexicographic values is not likely to produce a good approximation, since the number of times a string occurs as a substring is not likely to be approximately the same for lexicographically successive substrings. Alternative methods have been proposed creating an end-biased histogram over a frequency sort of the attributes. Although the method has a close parallel to a PST, pruned count-suffix trees do better than such methods by taking domain knowledge into account.
Nevertheless, known estimation methods using PSTs can be inefficient and inaccurate. Thus, a need exists for systems and methods for performing substring selectivity estimation that are both more accurate and more efficient than currently-known systems and methods.
The present invention discloses systems and methods for estimating how many records in a database match a certain query condition. In particular, the present invention discloses systems and methods for substring selectivity that take advantage of database structure.
In one embodiment of the present invention, a probability of occurrence is received for each of a plurality of maximal substrings where each substring in the plurality of maximal substrings belongs to the string. All of these received probabilities of occurrence are multiplied together to obtain a single number, that number being called the first estimate. Next, a probability of occurrence for the maximal overlap of each string between adjacent strings in the plurality of substrings is received. The probabilities are all multiplied together to obtain a single number called a normalization factor. Finally, the first estimate is divided by the normalization factor to obtain a resulting estimate of substring selectivity.