For decades, computer programmers and scientists have been discovering new and more efficient ways to maintain databases and to search their contents. Standard techniques for conducting searches include linear searching, hashing, and binary search trees.
The linear search technique is the most basic method for conducting a search of keys or strings (the term "string" is used to describe a group of alphanumeric or binary characters). This straightforward method loops over all the existing strings, which are usually organized as an array or a linear linked list, comparing each string K.sub.i with the requested string K. A linear search may be represented by the following pseudo code:
FOR i=1 TO n DO PA1 END DO PA1 T.sub.linear (n)=O(n) PA1 i=H(K) PA1 RETURN T(i) PA1 H(K)=(sum of all bytes b.sub.1, . . . ,b.sub.m (ASCII codes) of the key string K) module M PA1 T.sub.hash (n)=O(n/M) PA1 value(T.sub.l)&lt;value(T.sub.r) for every descendant node T.sub.l in the left subtree of T and every node T.sub.r in the right subtree of T, where value(T) is the string value associated with the tree node T. PA1 PROCEDURE TREE-SEARCH(T,K) PA1 T.sub.tree =O(log.sub.2 n)
IF K.sub.i =requested string K THEN RETURN I(K.sub.i) PA2 IF K=value(T) THEN PA2 ELSE PA2 ENDIF
where I(K) is the information record associated with string K. The running time of such a search, using O-Notation (a function f(n) is said to be O(g(n)) if there exists a number n.sub.0 such that f (n).ltoreq.const. factorg(n) for all n.gtoreq.n.sub.0), is:
Thus, the running time of a linear search grows proportionally with the number of strings n.
Because of its simplicity, a linear search has less overhead processing per character comparison than more advanced searching methods. Generally speaking, a linear search is faster than other searching methods when the set of strings to be searched is small. Therefore, and for ease and robustness of implementation, linear searches are often preferred for small scale searches, where n&lt;100.
A disadvantage of linear searching is the linear dependency on the number of entries, or strings, n. The linear search method becomes impractical for high performance applications for growing n. Searching a 10,000-entry table will be 1,000 times slower than searching a 10-entry table.
Another popular technique for searching is hashing. Hashing computes the address or location of a given string directly from the string's binary representation, rather than by exhaustively searching all strings as in the linear search method. A hash search is often a two-step process, wherein the hash function H returns an index referring to a small list of strings that are searched using the linear search method. A hash search may be represented by the following algorithm in pseudo code:
where the hash function H computes the index of the given string, returning index i. Index i may refer to the matching string alone, or may refer to a list of strings that must then be searched linearly for the matching string.
A commonly used hash function H organizes or indexes the strings into a hash table T utilizing the following formula:
where M denotes the size of hash table T. Obviously, the hash function H cannot in general be guaranteed to be unique for each string. When two or more strings result in the same hash function (H(K.sub.i)=H(K.sub.j) for different strings K.sub.i .noteq.K.sub.j), it is called a collision case. The most common way to deal with a collision case is to maintain lists of strings with identical hash values for each hash table entry, which requires searching, usually by the linear search method, a collision list in order to uniquely find the requested entry.
In general, the running time of hashing depends on the average length of the collision lists, which in turn depends on the distribution of the hash function H as well as the hash table size M. Assuming hash function H has a nearly perfect distribution (i.e. the probability for any key string K to be scattered to index i=H(K) is equally likely for all i=1 . . . M), it can be shown that the average running time of hashing will be
The result is a running time that is nearly constant for sufficiently large hash table sizes M&gt;n. Therefore, in theory, the running time could be expected to be nearly independent of n, provided that a perfect hash function (an unrealistic expectation in the majority of real-world applications) could be used.
A disadvantage of hashing is that the inherent need for resolving collision cases requires an additional, and sometimes lengthy, search to take place. Although the average search utilizing the hash technique may be quick, the actual length of time to complete a search may be considerably worse for certain string distributions. In the worst case, all the strings happen to end up in the same index, and the overall performance of the search will be no better than for a linear search. Therefore, in practice, finding an efficient hash function for a real-world application is a difficult task, and is significantly dependent on the actual probability distribution of the strings to be indexed.
Another disadvantage of hashing is that it does not lend itself to wildcard searches. A wildcard search is one where one or more characters in the search string is a wildcard character (i.e., a character that can replace any other character). A wildcard search often returns multiple matching strings.
Another technique for searching is the search tree method. Before this type of search is performed, a search tree must be created to organize the data on which searches are to be performed. A variety of search tree implementations have been proposed, among which one of the most basic is the binary search tree, which is defined as follows.
A (binary) tree T over a set of strings K.sub.1, . . . ,K.sub.n (represented by the tree nodes) is called a search tree, if for every sub-node T the following condition holds:
Thus, the basic search procedure can be (recursively) formulated in pseudo codeas follows (as usual, K denotes the string which is being searched and special cases like non-existent keys are omitted for simplicity):
RETURN Information associated with T
IF K&lt;value(T) THEN PA4 TREE-SEARCH(left-subtree(T),K) PA3 ELSE PA4 TREE-SEARCH(right-subtree(T),K) PA3 ENDIF
The search tree method outlined above executes a depth-first tree traversal, resulting in a running time that grows proportionally to the depth of the tree. Consequently, given an adequately balanced search tree (i.e., one whose leaf nodes are essentially evenly distributed), the average running time is
Thus, in theory, the average running time grows logarithmically with the number of entries or strings. This is a substantial improvement over linear searches when searching through a large number of entries.
A disadvantage of the tree search method is that under field conditions, the running time may vary greatly because in practice search trees are rarely balanced. The tree's balancing properties are heavily dependent upon the actual string distribution. More sophisticated methods, such as AVL-trees, described in Van Wyk, Christopher J., Data Structures and C. Programs, Addison-Wesley Publishing, 1988, have been invented to minimize this problem. Such methods, however, tend also to increase the implementational overhead for tree structure administration. In practice, and depending on the actual implementation, tree-based searches rarely outperform simple linear searches unless the number of entries exceeds a break-even point of several hundred entries.
The present invention overcomes the foregoing problems by providing a method for fast indexing and retrieval of alphanumeric or binary strings that supports both generic indexing and partial match queries. The invention utilizes a unique compacted index tree wherein a node may be used to step through a plurality of characters in a search string, and can have as many successor nodes as there are individual characters in the underlying string alphabet (e.g., 256 for 8-bit based characters). Furthermore, the invention uses backtracking when a plurality of subtrees needs to be searched for possible multiple matching strings during a wildcard search. The invention also takes advantage of heuristic subtree pruning, a method by which partial match searches may be accelerated by discarding whole subtrees.
Although the present invention is contemplated for use with strings, such a search method may be used to search any digital information stored in a database.