1. Field of the Invention
The present invention is directed to the field of database management, and, more specifically, to using a compressed data structure to estimate the amount of data processed by a query.
2. Description of the Prior Art
Prior to executing a query, a database management system (DBMS) may determine a xe2x80x9cplanxe2x80x9d for executing the query in the most efficient manner. To determine the plan, the DBMS estimates the amount of data that will be processed by a query at each stage of the execution. To make such estimation, the DBMS may use a data structure referred to as a xe2x80x9ctrie.xe2x80x9d The trie is a model of a set of strings stored in a collection of data such as, for example, a relational data table. The trie enables the DBMS to quickly determine the number of strings in the collection of data that match a like predicate in a query.
An exemplary conventional trie is shown in FIG. 1. The exemplary trie of FIG. 1 includes the following strings: apple, apply, applying, seated, and seating. As shown, the top node 110 in trie 100, which may be referred to as the xe2x80x9crootxe2x80x9d node, is empty. The remaining bottom nodes each include a single character. A square node identifies the last letter in each string. Tracing a path from the root node to a corresponding square node and concatenating the characters stored in the rightmost nodes at each level of the path forms each string.
A conventional trie such as trie 100 of FIG. 1 has several drawbacks. Because each node in the trie includes only a single character, the trie may include a large number of nodes that occupy a large amount of memory. Furthermore, character-by-character matching may require a lot of time to perform, thereby delaying query execution. Another drawback is that repetitive suffixes such as xe2x80x9cingxe2x80x9d, which is a suffix in both xe2x80x9capplyingxe2x80x9d and xe2x80x9cseatingxe2x80x9d, are identified in the trie multiple times. Such suffix repetition increases the amount of memory required to store the trie and increases the time required to perform matching. Thus, there is a need in the art for a xe2x80x9ccompressedxe2x80x9d trie, in which multiple characters may be stored in a single node. Furthermore, it is desired that repetitive suffixes be identified and eliminated from such a compressed trie.
Accordingly, systems and methods for using a compressed trie to estimate like predicates are described. A compressed trie in accordance with the present invention has nodes including multiple character sub-strings. Such multiple character storage reduces the number of nodes in the trie, thereby reducing the amount of memory required for storing the trie and reducing the amount of time required to perform matching. Furthermore, in such a compressed trie, sub-strings are stored in a single character string. Each node references its corresponding sub-string by the sub-string""s starting position and length in the character string. Multiple nodes may reference a single sub-string. Thus, referencing rather than storing sub-strings in corresponding nodes eliminates repetitive sub-string storage, thereby reducing the amount of memory required for storing the trie.
An exemplary embodiment of the present invention enables a string to be inserted into the trie. The string is assigned to one or more nodes in the trie by dividing the string into one or more sub-strings and assigning each sub-string to a corresponding node. Each sub-string is then added to a character string, in which each sub-string is preferably identified by a starting position and a length. The starting position and length of each sub-string is then stored at its corresponding node.
Another exemplary embodiment of the present invention enables the trie to be used to estimate the number of rows in a data table that match a like predicate. Beginning at a root node, the nodes in the trie are examined to determine if they match the like predicate. After examination, the counts of occurrences at each matching node are accumulated to determine a non-scaled estimate. The non-scaled estimate is then scaled based on the representative portion of the table that is included in the trie.