String data is ubiquitous. For example, product catalog databases (for books, music, software and the like), electronic white and yellow page directories, and specialized information sources such as patent databases and bibliographic databases, all of which deal with string (text) data, are proliferating on the Internet. Most applications now have a prominent interface that allows string-based querying and searching. A critical requirement in this context is the ability to use a specified substring (referred to as a “query” substring) to find all of its occurrences in a particular database. Sometimes, one may be interested in a prefix (or suffix) match, where the specified substring occurs at the beginning (or, alternatively, the end) of the database string. At other times, one may simply be interested in a substring occurrence irrespective of its location.
The quality of the string information residing in various databases can be degraded due to a variety of reasons, including human error (particularly when human data entry methods are used to add information to the database). Moreover, the querying agent may itself make errors in specifying the pattern desired, as would occur with a mis-spelling in a query substring, such as with a name. In any event, there are many occasions where a given query pattern does not exactly match the database strings that one would presume as a “match” but for the mis-spellings or other data entry errors.
As an example, consider a well-known database textbook by Silberschatz, Korth and Sudarshan. One public website has the last author's name mis-spelled as “Sudershan”. Therefore, someone performing a query in this particular database to find all books authored by “Sudarshan” will never find this well-known database textbook. Such an error is not unique. For example, there is a well-known author of books on the subject of theoretical physics with the name “E. C. G. Sudershan”. The database entries for some of his books have the last name spelled “Sudershan” and others use “Sudarshan”. In any event, a search for “books by the same author” will result in producing an incomplete listing.
A large body of work has been devoted to the development of efficient main memory solutions to the approximate string matching problem. For two strings of length n and m, available in main memory, there exists a dynamic programming algorithm to compute the edit distance of the strings in O(nm) time and space. Improvements to the basic algorithm have appeared, offering better average and worst case running times, as well as graceful space behavior. A different approach is based on the use of deterministic and non-deterministic automata. Although such approaches are best in terms of worst case, they have large space requirements and they are relatively difficult to build and maintain.
To handle larger text queries in main memory, various approaches have been introduced. Several researchers have reduced the problem of “approximate” string searching to that of “exact” searching, which is well understood. The basic idea is as follows: For a string that occurs in a text with k errors, if the query string is arbitrarily cut into k+1 pieces, then at least one of the pieces will be present in the text with no errors. An additional approach to reduce the problem of approximate string matching to that of exact string matching is to use all (or part) of overlapping pieces of length q (defined as “q-grams”). E. Sutinen et al., in the reference “On Using q-gram Locations In Approximate String Matching”, appearing in Proceedings of the ESA, 1995, discuss how to perform a search by examining samples of q-grams separated by a specific number of characters.
The subject of approximating the identification of relevant strings in secondary storage is a relatively new area. Indexes are used to store a dictionary and use a main memory algorithm to obtain a set of words to retrieve from the strings in storage. Exact text searching is thereafter applied. These approaches are rather limited in scope, due to the static nature of the dictionary, and are not considered suitable for dynamic environments or when the domain of possible strings is unbounded. Other approaches rely on suffix trees to guide the search for approximate string matches. However, suffix trees impose very large space requirements. Moreover, they are relatively static structures, and are hard to efficiently maintain in secondary storage. Thus, the use of suffix trees is not considered as well-suited for database applications.
Thus, a need remains in the art to be able to efficiently find all strings approximately containing a given query substring from a large collection of strings.