Relational database queries often include equality or LIKE selection predicates over string attributes. Existing techniques for estimating selectivities of string predicates are biased towards underestimating selectivities. String-valued data has become commonplace in relational databases as have complex queries with selection predicates over string attributes. An example of a selection predicate over a string attribute is Author.name like % ullman %. Query optimizers rely heavily on estimates of the selectivity of query predicates. As a result, selectivity estimation of string predicates has been used to define query execution plans.
One common class of string predicates is called wildcard predicates. Wildcard predicates are of the form R.A like % s %, where A is a string-valued attribute of a relation R. Techniques have been proposed for estimating the selectivity of wildcard predicates. Some prior techniques build summary structures, such as pruned suffix trees or Markov tables. These summary structures record the frequency of selected strings. The frequency of a string in a relation attribute is the number of attribute values that include the string. The set of string-frequency pairs retained varies with the summary structure. At run time, one existing technique for estimating the selectivity of a string predicate R.A like % s % involves two parts:                (i) parsing the query string s into possibly overlapping substrings s1, . . . , sk whose frequencies can be looked up in the summary structure, and        (ii) combining the selectivities of the overlapping substring predicates to estimate the selectivity of the original query predicate.        
To combine the selectivity of the substring predicates, existing techniques mainly rely either on an independence assumption or on a Markov assumption. The independence assumption assumes that the selectivity of a string predicate R.A like % si % is independent of that associated with sj, for all j≠i. The Markov assumption assumes that the selectivity of a string predicate R.A like % si % depends only on that of R.A like % si−1%.
The paper Krishnan et al., Estimating alphanumeric selectivity in the presence of wildcards, Proc. 1996 ACM SIGMOD Intl. Conf. on Management of Data, pp 282–293, 1996 (herein “Krishnan paper”) discloses one approach to estimating selectivity. The Krishnan paper discloses the use of suffix trees for summarizing string values in a column. For a given relational attribute, a suffix tree is built to maintain frequencies of all suffixes of attribute values. The suffix tree is pruned so that it fits in the allocated amount of space. The pruned suffix tree retains only the most frequent substrings of attribute values. For estimating the frequency of a query string s, the Krishnan paper discloses dividing a given substring s into disjoint strings s1, . . . , sk such that each substring si occurs in the suffix tree. The Krishnan paper assumes that an attribute value containing si as a substring is independent of the attribute value containing some other substring sj. The estimated selectivity of the initial string is the product of the selectivities of the s1, . . . , sk substrings. The Krishnan paper considers weighted combinations of estimates of suffixes, where the weight of an estimate is proportional to a suffix's length.
The paper Jagadish et al., Substring selectivity estimation, Proc. of the 18th ACM SIGACT-SIGMOD-SIGART Symposium on Principals of Database Systems, pp. 249–260, 1999. (herein “Jagadish substring selectivity estimation paper”) discloses relaxing the independence assumption relied upon in the Krishnan paper. The Jagadish substring selectivity estimation paper relies on the Markovian “short memory” assumption. According to Markovian assumption, the probability of an attribute value v containing a substring si+1 only depends on attribute values v containing substring si and not on the earlier substrings. Furthermore, the Jagadish substring selectivity estimation paper allows adjacent substrings to overlap.
The paper Jagadish et al., Multi-dimensional substring selectivity estimation, Proc. of the 25th Intl. Conf. on Very Large Data Bases, pp. 387–398, 1999 discloses adapting the methods disclosed in the Krishnan paper and the substring selectivity estimation paper to multi-attribute string predicate estimation by constructing one suffix tree per attribute. The paper Chen et al., Selectivity estimation for Boolean queries, Proc. of the 19th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp 216–225, 2000 (herein “Chen paper”) discloses estimating selectivities of Boolean queries involving string predicates potentially over multiple attributes. The Chen paper also enhances the pruned suffix trees by maintaining summary vectors with each node. The summary vector of a node represents a “signature” of all tuples with the node's associated string as a substring. These summary vectors can be used to combine selectivity estimates of individual terms in a Boolean query predicate.
The paper Aboulnaga et al., Estimating the Selectivity of XML path expressions for internet scale applications, Proc. of the 27th International Conference on Very Large Data Bases, pp. 591–600, 2001 (herein “the Aboulnaga paper”) discloses using Markov tables over XML tag sequences as the summary structure for the problem of estimating the selectivity of simple XML path expressions consisting of XML tags. A Markov table of XML tags for an XML data set records the selectivity of all possible sequences of tags of length not exceeding a pre-specified constant q. The value of the constant q determines the amount of space required to store the Markov table. The Aboulnaga paper also proposes techniques for pruning the Markov tables so that they do not require more than some given amount of space. The paper Lim et al., An on-line self tuning Markov histogram for XML path selectivity estimation, Proc. of the 28th International Conference on Very Large Data Bases, pp. 442–453, 2002 discloses improving the pruning of the Markov tables by retaining the selectivity of substrings that are frequently used in a representative workload.
There is a need for a selectivity estimation technique that overcomes the underestimation problem associated with existing selectivity estimation techniques.