This invention relates to methods, apparatus and computer products for computer database searching and, more particularly, methods, apparatus and computer products for searching documents created using optical character recognition techniques.
Much of the information upon which business and government rely is, and has been, stored on paper. With the advent of readily accessible wide area networks, high-speed optical scanners, and cheap mass storage, there has been an attempt in recent years to make paper information machine-accessible.
Machine-accessible information has many advantages over paper. Electronic data storage is far less expensive than filing cabinets in storage rooms, especially once rent is considered. Retrieval times are measured in seconds or tenths of seconds rather than minutes, hours, or even days, particularly for information in large archives. Information replication is trivial, and multiple people can access a single document simultaneously. Unfortunately, the task of converting the mass of existing paper information into machine-accessible form is daunting.
One method scans each document using an optical scanner and automatically processes each document as it is scanned. An optical scanner creates an electronic image of a document. Optical character recognition (OCR) software processes the electronic image and creates an electronic text file representing the document. xe2x80x9cIndexingxe2x80x9d software reads each text file and creates an index for all of the documents. A search program can then use the index to locate documents that contain a specified word, or combination of words. The process of indexing and searching documents is referred to as full-text indexing and retrieval.
Full-text indexing and retrieval has two powerful assets: it is fully automatic (and thus relatively inexpensive), and is based directly upon the actual contents of the document scanned. High-end retrieval systems may include context sensitivity, which permits the location of documents that contain related words, in situations where a user specifies the subject of a document but not its exact phrasing. World Wide Web search engines use full-text retrieval engines to search millions of electronic documents.
Search engines sometimes fail to locate documents that have been created using scanners and OCR software. This is due to the existence of numerous errors in large databases made up of scanned documents. A large database may include more than a million documents and ten million pages. To search for a document, a user must specify a combination of words, perhaps three or more, that either make a document unique, or at least restrict the list of search results to a manageable size. If a potential target document includes errors in the keywords used for the search, the search engine will not locate the document. OCR programs often produce several errors per page. An example of such an error would be a letter, e.g., an upper case xe2x80x9cIxe2x80x9d, misrepresented as a similar letter, e.g. a lower case xe2x80x9clxe2x80x9d (el).
One solution to the problem is a xe2x80x9cfuzzy search.xe2x80x9d Fuzzy searching is based on the concept that words containing errors are structurally similar to the true version of the word. For example, xe2x80x9cinternetxe2x80x9d and xe2x80x9cintemetxe2x80x9d are structurally similar. The first word can be changed into the second by deleting one letter and substituting an xe2x80x9cmxe2x80x9d for the other. Fuzzy search routines count the changes necessary to change one word into another. If few enough changes are required, a match is reported. This is computationally expensive because, during a search, every unique word in the database is individually compared to the key word to determine whether there is a match. Because OCR errors frequently produce xe2x80x9cunique words,xe2x80x9d the database containing the full-text index of a large archive can have more than a million unique words to compare to each key word. Even on a fast server, such a search takes time.
In addition to the amount of time it takes, fuzzy searching can result in a large volume of xe2x80x9chits.xe2x80x9d In a large database, many searches return thousands of matches. xe2x80x9cInternetxe2x80x9d is similar to xe2x80x9cintemet,xe2x80x9d but so is xe2x80x9cintem,xe2x80x9d xe2x80x9cundernetxe2x80x9d, and even xe2x80x9cinternationalxe2x80x9d. A search for xe2x80x9cboatxe2x80x9d might match xe2x80x9ccoat,xe2x80x9d even though an OCR program is very unlikely to confuse a xe2x80x9cbxe2x80x9d for a xe2x80x9cc.xe2x80x9d
It is desirable to have a mechanism that allows a search engine to accurately locate electronic documents that have been created using OCR software. Preferably, such a mechanism will recognize errors that are typically produced by OCR software and account for errors having the highest probability of occurrence. Additionally, a preferable mechanism will minimize the amount of processing that occurs when a search is requested by a user, in order to reduce the time of each search.
In accordance with this invention, a method and computer product for processing a search request in order to compensate for characters and character strings improperly interpreted during optical character recognition (OCR) scanning is provided. After an alphanumeric search request is received, the mechanism of the invention determines variant words associated with the received alphanumeric search request according to a predefined table of possible OCR substitutions, the OCR substitutions"" probability of occurrence, and a predefined threshold of probability of occurrences. A database with OCR scanned documents is then searched for the variant words.
In accordance with other aspects of the invention, variant words are determined by determining word segments that represent OCR interpretations of portions of the search request. A cumulative probability for each word segment is determined and, if the cumulative probability for a word segment is below a predetermined threshold, the word segment is rejected as a variant word.
In accordance with further aspects of the invention, a tree data structure is created, having branch nodes and substitution nodes. Each branch node represents a possible delineation of a character during OCR processing. Each substitution node represents a possible OCR substitution for the character corresponding to the parent branch node. The substitution nodes along a path from the root to a leaf node form a variant word. The cumulative probability for a substitution node is determined by multiplying the probability of occurrence for the node by the cumulative probability of occurrence for the node""s grandparent substitution node.
As will be readily appreciated from the foregoing summary, the invention provides a new and improved method, apparatus and computer product for word searching of electronic documents produced using optical character recognition. The invention reduces the number of documents that are missed during a search due to OCR errors when the documents are originally translated into electronic form. The invention also reduces the amount of time required to perform a search by minimizing the amount of processing that is performed after the search request is received. Finally, because the variant words constructed in this manner are rarely legitimate words in the natural language of the database, the number of false xe2x80x9chitsxe2x80x9d is greatly reduced.