1. Field of the Invention
The present invention relates generally to the field of document storage and retrieval systems of the type used for multiple document manipulation. Specifically, the invention relates to a method and system for selectively searching and retrieving information from stored documents using a non-literal search strategy.
2. Brief Description of Background Art
Electronic storage of documents has facilitated the handling of large volumes of documents, such as those handled by hospitals, universities, government institutions, and the like. Typically, the documents are entered into massive storage systems by use of a scanner system that converts text into electronic data. Documents primarily containing text can readily be scanned and stored in various electronic forms in this manner. Selective retrieval of information from the stored document set poses significant problems due to the volume of information to be searched.
Typical existing systems assign an index to each document as it is entered into storage. The index may be a system-generated or a user-defined code associated with each document. The code then is stored together with the document. To retrieve a document, a user must enter the appropriate code associated with the desired document. Other systems use key words extracted from the document which the user may then use to retrieve a document. The problem encountered with such systems is that a user may retrieve only entire documents, and must know the index, code, or key words associated with a desired document.
Other systems permit users to access selected information from a document set by entering a search term into the system. The system then reads through the entire document set to find an exact match for the entered search term. However, in some instances there may be a mismatch between the search term and the term in the document set. For example, a user may enter a wrong or unintended search term, such as by making a keyboarding or other error when entering the search term. As another example, there may be an error in the original text, OCR, or manually entered key word. Existing systems that search for exact matches are incapable of handling such errors in entering search terms, and would be unable to retrieve a desired document.
A non-literal, or "fuzzy", search involves entering a text string into a computer system and then searching for a "close" match of that text string in a stored text file. For example, a user may request a search on "recieve" (spelled incorrectly), and the system may find the correctly spelled word "receive". In another example, if the stored text file is obtained from optical character recognition (OCR) of an optically scanned document, often the OCR system misrecognizes characters that are typographically similar. The letter "O" may be misrecognized as the numeral "0", or the letter pair "rn" may be misrecognized as the single letter "m". In these instances, it would be desirable to retrieve text that is typographically close to the input text string.
In prior art systems, once a user types in a search term, one or more "matches" are found in the target document set and presented to the user for selection. The "best" match term, as ultimately determined by which of the match terms are selected by the user, may be buried among a list of possible matches or may be at the top of the list. Typically, the order of displayed retrieved terms is based on criteria that are not user-dependent. However, if the same user is using a system for retrieving documents, and that user consistently makes the same keyboarding or other errors in entering a search request, or, if consistent errors occur in stored text, such as OCR errors, it would be advantageous to have an adaptive prediction of the "best" match, as based on past selections, automatically appear at or near the top of the selection list. Existing systems do not incorporate such an adaptive feature.
Thus, there remains a need for a method and system for selectively retrieving information from a document set based on an adaptive non-literal search strategy.