The value of computerized storage of handwritten documents would be greatly enhanced if they could be searched and retrieved in ways analogous to the methods used for text documents. If precise transcripts of handwritten documents exist, then information retrieval (IR) techniques can be applied. However, such transcripts are typically too costly to generate by hand, and machine recognition methods for automating the process of transcript generation are far from perfect. Thus, such transcripts are usually either incomplete or corrupted by incorrect transcriptions, or both.
Even though transcripts have these types of problems, IR is still used on them, primarily through two techniques commonly called “text-to-text” matching and “ink-to-ink” matching. With “text-to-text” matching, or simply text matching, each handwritten document is converted to text and a text query is compared to the text of the handwritten document to determine if there are any matches. Generally, most handwriting machine transcription systems generate a list of alternative words, with corresponding word scores, for each handwritten word (also called an “ink word” herein) in the document. The word score indicates the likelihood that the associated text word is the correct transcription for the corresponding ink word. The word in the list with the highest word score is selected as the word that is subsequently used for text matching.
One problem with the first technique for IR is that the query cannot be handwritten and must, instead, be typewritten or converted from handwriting to text with concomitant errors in document transcription. A second problem with this technique occurs because of the errors in transcription. An error in transcription can prevent a document from being retrieved when the document should be retrieved. For example, if a person writes the word “cat” as part of a document, the handwritten word “cat” may be converted to the following list of alternative text words: (1) “cut” with a word score of 100; (2) “cot” with a word score of 95; (3) “cat” with a word score of 94; and (4) “lot” with a word score of 10. When this document is transcribed and stored, the word “cut” has the highest word score, and will be selected as the most probable transcription. The word “cut” will be the only stored word. If a user types in the text query “cat,” this query may not find this document because this instance of the handwritten word “cat” is incorrectly transcribed. This is true even though the recognition list (or “stack”) contains the true transcription of the written word “cat.” Moreover, if the writer is consistent, it is likely that any handwritten instance of “cat” will be similarly erroneously transcribed.
For the case in which recognition accuracy is not high, high word redundancy in the target documents can compensate for the imperfect transcription. However, this may not work if document word redundancy is low, as is common in short documents, or if recognition accuracy is not high, as is common for some handwritten documents.
Some have addressed the problem of transcription errors on retrieval in the context of speech, which can be analogous to retrieval of handwritten documents. To reduce transcription errors during retrieval, one of these approaches relies on query expansion, while a second employs a variety of string distance methods, and a third uses global information about probable phoneme confusions in the form of an average confusion matrix for all data observed. These techniques are described in the following respective documents, the disclosures of which are incorporated herein by reference: Jourlin et al., “Improving retrieval on imperfect speech transcription,” Proc. of the 22nd Annual Int'l Ass'n of Computing Machinery (ACM) Special Interest Group on IR (SIGIR) Conf. on Research and Development in IR, 283-284 (August, 1999); Zobel et al., “Phonetic String Matching: Lessons from Information Retrieval,” Proc. of the 19th Ann. Int'l ACM SIGIR Conf. on Research and Development in IR, 166-172 (August, 1996); and Srinivasan et al., “Phonetic confusion matrix based spoken document retrieval,” Proc. of the 23rd Ann. Int'l ACM SIGIR Conf. on Research and Development in IR, 81-87 (July, 2000). While these approaches limit the effect of transcription errors, they still do not allow for handwritten queries.
The second technique for IR on handwritten documents is matching a handwritten query to handwritten words in a handwritten document (often called “ink-to-ink” matching). A class of successful approaches uses template matching between query ink and document ink. This is explained in more detail in each of the following references, the disclosures of which are incorporated herein by reference: Aref et al., “The Handwritten Trie: Indexing Electronic Ink,” Proc. of the 1995 ACM Special Interest Group on Management of Data (SIGMOD) Int'l Conf. on Management of Data 151-162 (May, 1995); El-Nasan et al., “Ink-Link,” Proc. of the 15th Int'l Conf. on Pattern Recognition, vol. 2, 573-576 (Sept., 2000); Lopresti et al., “On the Searchability of Electronic Ink,” Proc. of the 6th Int'l Workshop on the Frontiers of Handwriting Recognition (August, 1998); and Lopresti et al., “Crossdomain searching Using Handwritten Queries. In Proc. of the 7th Int'l Workshop on the Frontiers of Handwriting Recognition (September, 2000). However, this method can be very slow if the number of documents to be searched is large and the match method is very complex. Additionally, it does not allow for text queries and will suffer if writing styles differ.
Another approach successfully used subunits of handwriting to handle inaccuracies in machine transcription. This approach attempts to reduce the complexity of the recognition process at the expense of allowing certain handwritten words to become ambiguous. This approach is discussed in Cooper, “How to Read Less and Know More: Approximate OCR for Thai,” Proc. of the 20th Ann. Int'l ACM SIGIR Conf. on Research and Development in IR, 216-225 (July, 1997). This approach was found to work well in domains in which words were long and easily distinguishable and more poorly in domains with a lot of similar words. Again, this approach does not allow text queries.
Currently, therefore, retrieval techniques exist that allow a user to enter text or written queries, but not both, to search handwritten documents. Also, these techniques do not work satisfactorily when the transcription is imperfect.