The present invention relates generally to the searching for query keywords in large databases of automatically recognized text and, more specifically, to a method for searching using approximate matching techniques which compensate for errors in text generated by an optical character recognition (OCR) system or a speech recognition (SR) system. The invention indexes the databases in a way that facilitates and speeds up the retrieval process.
Large text retrieval systems are often built by extracting text information from documents using OCR or from spoken words, using a speech recognition system and inserting the extracted text into a database. OCR and SR devices are prone to errors and hence the database may contain erroneous words. This makes it difficult to retrieve documents that contain query words given by a user.
Today, it is common practice to store a large number of paper documents into a database. These documents need to be retrieved later by searching for some words that appear in the document. One way to achieve this is by extracting words from digitized documents using OCR technology. These words are then stored into the database and are used for searching and retrieval.
In addition, speech recognition systems are becoming more widely used. With a commercially available speech recognition system, a user may dictate a new document for the database or convert an existing document from text to machine readable form by simply reading the document into a microphone connected to a personal computer. While these systems are generally accurate with sufficient training, it is well known that some errors do occur. Typically, the speech recognition system allows these errors to be corrected manually by the user. A user may not, however, detect all of the errors and, so, the resulting database may include words which are not in the original text document.
Because neither OCR technology nor speech recognition technology is 100% error free, the scanned database may contain words that were misspelled in the conversion process. The conventional method of using direct searching techniques may not locate all of the appropriate documents in the database that contain a given query word because the corresponding word in the database is misspelled in some of the documents.
Another problem with the conventional method of using direct searching techniques is that the size of the database may be very large, and hence the search process may be very slow.
The conventional solution is insufficient to find words in the database that have been recognized incorrectly and are misspelled. Also, the conventional solution is too slow in searching large database. There is a need for a more efficient search algorithm.
To meet this and other needs, and in view of its purposes, the present invention provides a simple and effective method for searching for a query word in a hierarchical data structure. The data structure has branch nodes and leaf nodes, each branch node represents a respective portion of one or more words and each leaf node represents a word. The data structure is searched for each query word by selecting the first letter of the query word and also selecting a root node in the hierarchical data structure as the current node. All possible child nodes of the current node are identified. Respective estimated probability values for matching respective components of the query word with the components associated with the nodes in the path taken through the hierarchical data structure is calculated for each identified child node. The identified child nodes are then added to a list of candidate nodes The candidate node with the highest probability value is selected as the current node and is then deleted from the list of candidate nodes. If a leaf node has been reached, then a determination is made whether to store the word into a list of best matches. Processing repeats itself for each portion of the query word. When all portions of the query word have been matched, the matched words with their respective probability values are stored into the list of best matches.
It is to be understood that both the foregoing general description and the following detailed description are exemplary, but are not restrictive, of the invention.