The present invention generally relates to the field of Structured Query Language (SQL) and in particular to a method for significantly improving the performance of executing pattern matching queries in which the special characters used in the pattern are limited to single and multi-character wild cards, rather than full regular expressions.
Structured Query Language (SQL) is a widely used language for accessing data stored in relational databases. To allow the selective retrieval of data, SQL provides predicates that can be applied to data to determine which records are retrieved. For example, the following SQL query uses a comparison predicate to select those employee records for any employee whose last name is xe2x80x9cSmithxe2x80x9d:
Select*From employee Where last_name=xe2x80x98Smithxe2x80x99
One particularly powerful predicate found in all SQL implementations is the LIKE predicate, which determines whether string-valued fields in the database conform to a particular pattern:
Select*From employee Where last_name Like xe2x80x98% sonxe2x80x99
The pattern, represented in the above query by xe2x80x98% sonxe2x80x99 describes a set consisting of all strings ending in xe2x80x9csonxe2x80x9d. The multi-character wild card xe2x80x9c%xe2x80x9d matches any sequence of characters, including the empty string. An underscore xe2x80x9c_xe2x80x9d) can be used to match a single arbitrary character. Aside from these two xe2x80x9cwild cardxe2x80x9d metacharacters, characters occurring in the pattern are matched only by the same character in the string being tested against the pattern.
The pattern language defined by the LIKE predicate is a subset of a notation known as regular expressions. In addition to single- and multi-character wild card matching, regular expressions may contain syntax specifying grouping, choice, and repetition of subpatterns. Regular expressions are commonly used in performing the lexical analysis phase of compilation of computer program source code, breaking the input stream into distinct entities called tokens that are input to a parser for the language.
The most efficient solutions to this problem employ finite automata to represent the patterns for the token types used in the language. There are two different types of automata that may be used to represent regular expressions. A deterministic finite automaton (DFA) has the defining property that for any given input symbol in any given state, there is at most one transition to a new state that can be made. In a nondeterministic finite automaton (NFA), it is possible that there are two or more possible transitions from a given state for a given input symbol. Simulation of NFAs is difficult due to the potential ambiguity of nondeterministic state transitions, so DFAs are generally used as recognizers for regular expressions. As each symbol is read, a state transition is made based on the input symbol and the current state. The most efficient transition function involves a simple table lookup, where the table contains the transition state for each symbol/state pair. This requires a significant amount of memory for single-octet character sets such as ASCII, and a prohibitive amount for multi-octet character sets such as Unicode. Sparse matrix techniques can be employed to reduce the storage requirements of the transition table, but this results in reduced efficiency for the transition function.
The restrictive nature of the pattern language used with SQL""s LIKE predicate makes it possible to derive a simpler, more efficient algorithm that does not require construction of automata and tracking of state changes for each character processed during evaluation of the predicate. While suitable for evaluation of fully general patterns based on unrestricted regular expressions, automata-based algorithms are overkill for the LIKE predicate. The present invention achieves significantly accelerated performance for evaluation of patterns such as those permitted by the LIKE predicate, which utilize only wild card operators (single-character and multi-character).
One prior art method to which the method of the present invention generally relates is described in U.S. Pat. No. 5,926,652 entitled MATCHING OF WILD CARD PATTERNS TO WILD CARD STRINGS ASSOCIATED WITH NAMED COMPUTER OBJECTS. The prior art method of matching computer wild card patterns involves comparing first and second character strings associated with named objects residing on a computer system to determine whether the first character string defines a first group of computer objects which is a logical subset of a second group of computer objects defined by the second character string. The first character string is provided by a user; such as in a command line, and the second character string is pre-defined, such as by a network administrator. The method can be performed by examining whether the second character string has any wild card character which can substitute for one and only one character, or by examining whether the second character string has any wild card character which can substitute for any number of characters, including no characters (a universal character). The second character string can further be examined to see if it contains any embedded sequence of characters having no universal wild card character.
The present invention differs from the prior art in that the prior art method is concerned with matching a pair of patterns to determine whether one is a logical subset of the other. The method of the present invention seeks instead to determine whether a text string (which is a literal value rather than another pattern) is a member of the set of strings derivable from the pattern. That is, it determines whether the string matches the pattern. Unlike the prior art method, the present invention decomposes the pattern into segments of literal text bounded by multi-character wild cards, and employs a string-matching algorithm modified to take single-character wild cards into account to search for the substrings within the target string.
Another prior art method to which the method of the present invention generally relates is detailed in U.S. Pat. No. 6,047,283 entitled FAST STRING SEARCHING AND INDEXING USING A SEARCH TREE HAVING A PLURALITY OF LINKED NODES. This prior art is a fast string indexing method that efficiently stores, searches, and removes alphanumeric or binary strings utilizing a compacted search tree. The number of levels in the search tree is minimized by having a node represent more than one character when possible. Each inner node of the tree contains a hash table array for successive hashing, which also minimizes the time required to traverse a given node. Searches may be performed for partial matches, such as wild cards at the character level. Multiple indices may be opened independently and concurrently on the same table of string entries.
The present invention differs from the prior art in that the prior art method is used as an access method on a database, and is comprised of a search tree that indexes strings so those strings may be looked up quickly in a database search. The intent is to overcome limitations of other access methods such as linear searching, hashing, and conventional search trees. Strings searched for in the search tree described in the prior art method can contain single-character wild cards, but not wild cards that match substrings of arbitrary length as in the present invention. The prior art method finds matching strings by traversing a tree structure containing representations of the strings being searched, while the current invention instead matches strings against a pattern without requiring those strings to be indexed. Another difference is that the present invention uses a string-matching algorithm to match literal substrings between wild cards, whereas the prior art method searches by traversing the nodes of its search tree.
It is therefore an object of the present invention to improve the performance of matching string patterns containing wild card characters.
Still another object of the present invention is to increase the efficiency of database transactions involving SQL queries that use the LIKE predicate for pattern matching.
Still another object of the present invention is to plug in a high-performance string search algorithm of choice in a wild card pattern matching task.
Other objects, features and advantages of the present invention will become readily apparent to those skilled in the art from the following detailed description, wherein is shown and described only the preferred embodiment of the invention, simply by way of illustration of the best mode contemplated of carrying out the invention. As will be realized, the invention is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive, and what is intended to be protected by Letters Patent is set forth in the appended claims. The present invention will become apparent when taken in conjunction with the following description and attached drawings, wherein like characters indicate like parts, and which drawings form a part of this application.
The method of the present invention is useful in a computer system including at least one client. The program executes a method for matching a pattern string with a target string, where the pattern string can contain single- or multi-character wild cards. The method includes the steps of preprocessing the pattern string into a prefix segment, a suffix segment, and zero or more interior segments. Next, matching the prefix segment, the suffix segment, and the interior segment(s) with the target string.
The present method enables a Client-User to initiate a Structured Query Language (SQL) query to search for strings of characters which match a pre-designated pattern having zero or more wild card characters. By having a network protocol via a file server, then a database can be accessed to retrieve information on matching characters. The method determines whether a text string is a member of a set of strings derivable from the pattern. A string-matching algorithm operates to match literal substrings between wild cards.