The exemplary embodiment relates generally to recognition of handwritten words in document images without having to detect or identify the individual characters making up the words or the full text. It relates particularly to a system and method for weighting fonts for training a probabilistic model using samples of synthesized training word images, and finds application in document classification, processing, analysis, sorting, detection, handwritten word spotting, and related arts.
Text of electronically encoded documents tends to be found in either of two distinct formats, namely bitmap format and character code format. In the former, the text is defined in terms of an array of pixels corresponding to the visual appearance of the page. A binary image is one in which a given pixel is either ON (typically black) or OFF (typically white). A pixel can be represented by one bit in a larger data structure. A grayscale image is one where each pixel can assume one of a number of shades of gray ranging from white to black. An N-bit pixel can represent 2N shades of gray. In a bitmap image, every pixel on the image has equal significance, and virtually any type of image (text, line graphics, and pictorial) can be represented this way. In character code format, the text is represented as a string of character codes, the most common being the ASCII codes. A character is typically represented by 8 bits.
There are many applications where it is desirable for text to be extracted from a document or a portion thereof which is in bitmap format. For example, a document may be available only in a printed version. In the case of a mailroom, for example, documents, such as letters, often arrive in unstructured format, and for ease of processing, are classified into a number of pre-defined categories. Manual classification is a time consuming process, often requiring a reviewer to read a sufficient portion of the document to form a conclusion as to how it should be categorized. Methods have been developed for automating this process. In the case of typed text, for example, the standard solution includes performing optical character recognition (OCR) on each letter and extracting a representation of the document, e.g., as a bag-of-words (BoW) in which a histogram of word frequencies is generated. Classification of the letter can then be performed, based on the BoW histogram.
However, a significant portion of the letter flow in a mailroom is typically handwritten. To handle handwritten text, one solution is to replace the OCR engine with a Handwriting Recognition (HWR) engine. However, this approach has at least two significant shortcomings: (i) the error rate of HWR engines is much higher than that of OCR engines and (ii) the processing time, i.e., the time it takes to obtain the full transcription of a page, is also very high (several seconds per page). When large numbers of documents are to be processed, as in the case of a mailroom, this is not a viable alternative for the handwritten letters.
“Word-spotting” methods have been developed to address the challenge of classifying handwritten documents. Such methods operate by detecting a specific keyword in a handwritten document without the need of performing a full transcription. For example, an organization dealing with contracts may wish to identify documents which include keywords such as “termination” or “cancellation” so that such documents can receive prompt attention. Other organizations may wish to characterize documents according to their subject matter for processing by different groups within the organization.
In current word spotting methods, handwritten samples of the keyword are extracted manually from sample documents and used to train a model which is then able to identify the keyword, with relatively good accuracy, when it appears in the document text. One system, based on hidden Markov models (HMMs), represents words as a concatenation of single-state character HMMs. This system employs segmentation of the characters prior to feature extraction. Another system uses multiple-state HMMs to model characters without requiring segmentation of words into characters.
Manual selection of handwritten samples can be time consuming. Accordingly, it has been proposed to learn statistical models to spot keywords in handwritten documents using, as training samples, word images synthesized automatically from computer fonts. However, there are many computer fonts available and selection of an appropriate set of fonts on which to train a keyword model may be time consuming. For example, assume that there is a pre-defined list of K keywords and that a handwritten corpus is labeled with respect to these keywords so that it is possible to measure the retrieval accuracy (e.g., in terms of precision and recall) with respect to the K keywords. Assume that a set of F fonts is available from which the best set of fonts is to be identified, based on accuracy. The font selection may involve testing the accuracy with models trained on different combinations of fonts. For example, models may be trained which combine the n best fonts (the fonts which, if used individually, score the highest), where n=1, 2, 3 . . . F. The retrieval accuracy is then determined for each model and for each keyword. An average score may be computed over all the keywords. The combination of fonts which produces the highest average of the retrieval accuracies is then selected. Such a heuristic is computationally intensive. For example, if K=10 keywords and F=25 fonts and a fairly small corpus of approximately 500 pages is used, it may take on the order of a day to select the optimal combination of fonts (on a single CPU of a 2.8 GHz AMD Opteron™ machine). Additionally, this method requires the hand-labeling of the corpus (or at least a subset of it) according to the set of K pre-defined keywords. This adds to the time consuming nature if the process.
Further, the fonts selected by this method may be influenced by the chosen keywords. Thus, if a user wishes to search for different keywords, the method may need to be repeated with the new keywords. While the number of keywords K could be increased, this would increase the computational cost and the amount of hand labeling required.
Additionally, the appropriate set of fonts may vary, depending on the type of handwritten documents on which the trained system is to be applied. For example, writing styles vary between countries and may also vary according to the age group of the writer, their occupation (e.g., doctor vs. school teacher), etc.
The exemplary embodiment provides a method for selection of weights for a set of computer fonts which overcomes the above-mentioned problems, and others.