XML is an example of a database of strings associated with attributes, and in native database management systems (“DBMSs”) of, for example, XML documents, the documents are stored as trees and queries are processed using either indexes or tree traversals. Nodes in a XML tree are often located using paths.
A path is a sequence of tag names that specify a navigational trajectory to a set of nodes in the XML tree. A rooted path is one that begins at the root node of the XML tree. For efficient processing, complex path expressions in XML queries are often preprocessed into a set of candidate (path, pred) pairs, where “path” is a linear rooted path and “pred” is a string predicate on the leaf value reachable via path. Consequently, an XML query (such as XQuery) can be mapped to several retrieval operations using (path, pred) pairs. These retrieval operations using (path, pred) pairs form the set of basic query processing operators. Accurate estimation of the selectivity of such (path, pred) pairs is therefore crucial for choosing an optimal execution plan in cost-based query optimization. Previous work on XML selectivity estimation has focused mainly on the tag-labeled paths (tree structure) of the XML data. Examples of this previous work are: A. Aboulnaga, et al., “Estimating the selectivity of XML path expressions for internet scale applications,” VLDB 2001, pp. 591-600, (2001), N. Polyzotis, et al., “Statistical synopses for graph-structured XML databases,” SIGMOD 2002, pp. 358-369, (2002).
However, for most actually existing attribute-string databases, such as XML, the number of distinct string values at the leaf nodes is orders of magnitude larger than the set of distinct rooted tag paths. A significant challenge therefore exists in accurately estimating the selectivity of string predicates on the leaf values reachable via a given path.
Using XML as an example, the XML string selectivity estimation problem is defined as follows: Given a (path, pred) query, where pred is a string predicate, estimate the number of nodes in the XML data that are reachable by path and whose associated string values satisfy the string predicate pred. Examples of queries with different string predicates include exact match queries and substring queries. An exact match query is specified by a (path, string) pair and retrieves all nodes reachable via path whose string value matches the query string exactly. A substring query is specified by a (path, substring) pair and retrieves all nodes reachable via path whose string value contains substring.
The present invention provides a novel method for collecting and maintaining statistics to support selectivity estimation for both exact/full string match queries and substring match queries of any database of documents arranged data strings associated with any number of attributes. In the case of XML, the data consists of path-string pairs, where each string is associated with exactly one attribute that is a path. In other cases, each string can be associated with any number (including zero) of attributes.