1. Field of the Invention
The present invention relates generally to the field of document storage and retrieval systems of the type used for multiple document manipulation. Specifically, the invention relates to a method and system for selectively searching and retrieving information from stored documents using a non-literal search strategy employing metric-based or "fuzzy" finite-state non-deterministic automata.
2. Description of Related Art
Electronic storage of documents has facilitated the handling of large volumes of documents, such as those handled by hospitals, universities, government institutions, and the like. Typically, the documents are entered into massive storage systems by various means, including direct typing, receipt of electronic mail, and scanning. Scanning systems often utilize optical character recognition (OCR) that converts text portions of scanned images into electronic data. Stored documents thus may contain mixtures of images, text, and annotations such as key words, and may be stored in various electronic forms. Selective retrieval of information from the stored document set poses significant problems due to the volume of information to be searched.
Existing archival and retrieval systems support a variety of search technologies. These include automatic or user defined indexing, key word annotation, automatic key word extraction, full text search, preprocessed indexing of some or all words or phrases in the text, and both literal and non-literal searches.
Typical existing systems assign an index to each document as it is entered into storage. The index may be a system-generated or a user-defined code associated with each document. The code then is stored together with the document. To retrieve a document, a user must enter the appropriate code associated with the desired document. Other systems use key words in a similar manner. There are many methods for identifying and assigning key words to a document, including direct keyboard entry by the user, interactive selection from the document text by the user, and automated extraction by a search of the document text. Once key words have been assigned to documents, the user may then use them to retrieve a document. Two problems encountered with such systems are that a user (1) can retrieve only entire documents, and (2) must know the index, code, or key words associated with a desired document.
Full text search systems permit users to access selected information from a document set by entering a search term into the system. The system then reads through the entire document set to find an exact match for the entered search term. This has the benefit of locating particular instances of strings within the document text. Full text-search systems facilitate features such as proximity searching, where the search expression may contain restrictions on the relative locations of document set text strings that match certain portions of the search expression. The problem encountered with such systems is that each search involves a complete pass across the entire document set text, which makes such searches slow for very large document sets.
Preprocessed, or indexed, search systems typically create tables of words found in the document set text. These tables greatly increase the efficiency of searches over large document sets. For example, in a very simple embodiment, the search is initially performed over the tables, and then only for documents that the tables indicate contain desirable target words. The tables can be sorted and cross-indexed in various standard ways to optimize performance in specific situations.
However, for both full text and indexed search systems, in some instances there may be a mismatch between the search term and the term in the document set. For example, a user may enter a wrong or unintended search term, such as by making a keyboarding or other error when entering the search term. As another example, there may be an error in the original text, OCR, or manually entered key word. Literal search systems that require exact matches are incapable of handling such mismatches between entered search terms and document set text, and would be unable to retrieve a desired document in such cases.
A non-literal, or "fuzzy", search system is capable of handling mismatches. Use of such a system involves entering a text string into a computer system and then searching for a "close" match of that text string in a stored text file. For example, a user may request a search on "receive" (spelled incorrectly), and the system may find the correctly spelled word "receive". In another example, if the stored text file is obtained from OCR of an optically scanned document, often the OCR system misrecognizes characters that are typographically similar. The letter "O" may be misrecognized as the numeral "0", or the letter pair "rn" may be misrecognized as the single letter "m". In these instances, it would be desirable to retrieve text that is typographically close to the input text string.
Known fuzzy search techniques are not well adapted to the task of finding documents containing words "close" to search terms. For example, a technique described in R. Baeza-Yates and G. Gonnet, "A New Approach to Text Searching", COMMUNICATIONS OF THE ACM 35, 10 (October 1992), 74-82, finds matches between a target word and a search term where the target word contains mismatched characters, but does not describe a technique to successfully handle missing characters, extra characters, or exchanged adjacent characters. A second technique, described in S. Wu and U. Manber, "Fast Text Searching Allowing Errors", COMMUNICATIONS OF THE ACM 35, 10 (October 1992), 83-91, supports only the use of small integer costs associated with mismatched characters, missing characters, or extra characters, thereby severely restricting the ability to fine-tune these costs, such as is required in the situation in which adaptive fine-tuning of the costs is desirable. In addition, their technique supports exchanged adjacent characters only as a combination of a missing and an extra character, so that the cost for exchanged adjacent characters is found only as the sum of the costs for a missing character and an extra character. To perform a fuzzy search, the Wu and Manber technique involves performing a search first for matches with no errors, then with one error, and so forth until sufficient matches are found.
A third technique, also developed by U. Manber and S. Wu, is described in a paper "Approximate String Matching with Arbitrary Costs for Text and Hypertext" dated February, 1990, and included in August, 1992 in the IAPR Workshop on Structural and Syntactic Pattern Recognition, Bern, Switzerland, handles missing and extra characters. The authors note "one drawback of the algorithm is that it cannot handle substitutions; that is, we assume that the cost of replacing one character by another is the same as the cost of deleting the first character and inserting the second." A similar problem exists with regard to exchanged adjacent characters.
A fourth technique, described in U.S. Pat. No. 4,985,863 by Fujisawa et. al., 1991, uses finite deterministic automata to search only literally for exact matches, but encodes into the OCR document text alternative identities of characters for which OCR had little certitude. This reference provides no support for missing characters, extra characters, or exchanged adjacent characters, and provides no general support for mismatched characters.
While each of these techniques may be suitable for specific limited uses, they are inconvenient for general use in finding a text string based on a search term when the number and type of errors in the search term is unknown. This limitation becomes especially acute as the number of distinct words in the document set grows very large.
Finite state automata have known uses in computer systems to parse a series of symbols to determine whether they match a specified pattern, where the symbols being analyzed are members of a finite symbol set, such as ASCII character codes. An automation starts operation from an initial state or art initial set of states, and then sequentially processes an incoming stream of symbols. As each incoming symbol is processed, the automation undergoes a change of state or states, depending on the previous state or states of the automation and the identity of the incoming symbol. If and when the automation reaches a terminal state just as the last of the incoming symbols is processed, the incoming stream of symbols is found to match a particular pattern that the automation was constructed to identify. Otherwise, the stream is found not to match any of the patterns that the automation was constructed to identify.
Automata may be either deterministic or non-deterministic. In a deterministic automation, at each point in time, the automation has a single current state, and there is a particular symbol which is going to be examined next. In the easiest cases, the result of processing that next symbol is that the automation is put into a single successor current state, which may be the same state but in any event is completely determined by the predecessor state and the input symbol. This process continues until all the symbols have been processed, a terminal state has been reached, or an incoming character is received for which there is no valid transition.
Depending on the design of the deterministic automation and the succession of states and input symbols, there may arise cases where there is more than one viable next state. Since only one state may be current at one time, the automation is copied as many times as there are viable next states, and each copy follows a different path through the sequence of states and next symbols. This tree of state sequences can have very large fanout, leading to great inefficiencies in processing. Even with backtracking, the process is fundamentally inefficient. The various sequences of successor states are exhaustively searched, one at a time, using backtracking whenever a particular path of states does not ultimately lead to a terminal state. As the tree of state sequences that needs examination grows, the amount of time required to perform such searching increases.
Deterministic automata are usable for such searching for small sets of known patterns, but are ill-suited for general use.
In a non-deterministic automation, multiple current states are permitted, and incoming symbols may result in a change from each current state to any of several successor states. When the end of the incoming symbol stream is reached, a search is made to determine whether any of the current states of the automation is a terminal state. If so, the incoming stream is found to match at least one of the patterns, although there may be no way to tell which particular pattern was matched.
There remains a need for an efficient general method and system for selectively retrieving information from a document set based on a potentially incorrect search term, and there remains an opportunity to apply finite-state non-deterministic automata technology to non-literal searching.