The exemplary embodiment relates generally to recognition of handwritten words in document images without having to detect or identify the individual characters making up the words or the full text. It relates particularly to a document categorizer for handwritten documents which is trained on document statistics generated by identifying discriminative words in training documents using models for these words which may employ synthesized word images as training samples, and finds application in document classification, processing, analysis, sorting, detection, word spotting, and related arts.
Text of electronically encoded documents tends to be found in either of two distinct formats, namely bitmap format and character code format. In the former, the text is defined in terms of an array of pixels corresponding to the visual appearance of the page. A binary image is one in which a given pixel is either ON (typically black) or OFF (typically white). A pixel can be represented by one bit in a larger data structure. A grayscale image is one where each pixel can assume one of a number of shades of gray ranging from white to black. An N-bit pixel can represent 2N shades of gray. In a bitmap image, every pixel on the image has equal significance, and virtually any type of image (text, line graphics, and pictorial) can be represented this way. In character code format, the text is represented as a string of character codes, the most common being the ASCII codes. A character is typically represented by 8 bits.
There are many applications where it is desirable for text to be extracted from a document or a portion thereof which is in bitmap format. For example, a document may be available only in a printed version. In the case of a mailroom, for example, documents, such as letters, often arrive in unstructured format, and for ease of processing, are classified into a number of pre-defined categories. Manual classification is a time consuming process, often requiring a reviewer to read a sufficient portion of the document to form a conclusion as to how it should be categorized. Methods have been developed for automating this process. In the case of typed text, for example, the standard solution includes performing OCR on each letter and extracting a representation of the document, e.g., as a bag-of-words (BoW) in which a histogram of word frequencies is generated. Classification of the letter can then be performed, based on the BoW histogram.
However, a significant portion of the letter flow in a mailroom is typically handwritten. To handle handwritten text, one solution would be to replace the OCR engine with a Handwriting Recognition (HWR) engine. However, this approach has at least two significant shortcomings: (i) the error rate of HWR engines is much higher than that of OCR engines and (ii) the processing time, i.e., the time it takes to obtain the full transcription of a page, is also very high (several seconds per page). When large numbers of documents are to be processed, as in the case of a mailroom, this is not a viable alternative for the handwritten letters.
“Word-spotting” methods have been developed to address the challenge of handwritten documents. Such methods operate by detecting a specific keyword in a handwritten document without the need of performing a full transcription. For example, an organization dealing with contracts may wish to identify documents which include keywords such as “termination” or “cancellation” so that such documents can receive prompt attention. Other organizations may wish to characterize documents according to their subject matter for processing by different groups within the organization.
In word spotting methods, handwritten samples of the keyword are extracted manually from sample documents and used to train a model which is then able to identify the keyword, with relatively good accuracy, when it appears in the document text. One system, based on hidden Markov models (HMMs), represents words as a concatenation of single-state character HMMs. This system employs segmentation of the characters prior to feature extraction. Another system uses multiple-state HMMs to model characters without requiring segmentation of words into characters.
One drawback which limits the usefulness of word spotting methods is that the keyword(s) for a given category need to be chosen carefully by a human operator. For some categories, a single word may be sufficient to ensure that a large proportion of the documents is identified. With other categories, finding a single word is more difficult. The problem is compounded as the number of categories increases since some keywords may be common to two or more categories. As a result, wide-spread deployment of word spotting techniques is difficult.