The exemplary embodiment relates generally to recognition of handwritten words in document images without having to detect or identify the individual characters making up the words or the full text. The exemplary embodiment finds application in document processing, analysis, sorting, detection, word spotting, and related arts.
Text of electronically encoded documents tends to be found in either of two distinct formats, namely bitmap format and character code format. In the former, the text is defined in terms of an array of pixels corresponding to the visual appearance of the page. A binary image is one in which a given pixel is either ON (typically black) or OFF (typically white). A pixel can be represented by one bit in a larger data structure. A grayscale image is one where each pixel can assume one of a number of shades of gray ranging from white to black. An N-bit pixel can represent 2N shades of gray. In a bitmap image, every pixel on the image has equal significance, and virtually any type of image (text, line graphics, and pictorial) can be represented this way. In character code format, the text is represented as a string of character codes, the most common being the ASCII codes. A character is typically represented by 8 bits.
There are many applications where it is desirable for character strings to be extracted from a document or a portion thereof which is in bitmap format. For example, a document may be available only in a printed version. In the domain of automated document processing, for example, a common task involves the categorization of documents. Many of the documents to be categorized are received in paper form, either because of their “legal” significance, as a backlog of old documents to be archived, or as general-purpose correspondence, and they need to be classified. Various techniques exist for classifying documents, whether based on the aspect of documents, on the textual content, or based on templates. All these techniques have their specific advantages and drawbacks.
There are a number of applications where the identification of whole words rather than individual characters or recognition of the full text is sufficient. For example, in some applications, it may be desirable to identify whether documents, such as incoming mail, include one or more specific words. These documents may then be processed differently from the rest of the mail. For example, an organization dealing with contracts may wish to identify documents which include keywords such as “termination” or “cancellation” so that such documents can receive prompt attention. Other organizations may wish to characterize documents according to their subject matter for processing by different groups within the organization.
It has been shown that identification of whole words is more robust for degraded images containing broken and touching characters. One system, based on hidden Markov models (HMMs), represents words as a concatenation of single-state character HMMs. This system employs segmentation of the characters prior to feature extraction. Another system uses multiple-state HMMs to model characters without requiring segmentation of words into characters.
When such word spotting techniques are used for handwritten documents, a codebook is generated for the words of interest. This generally involves collecting a large number of handwritten samples for each word of interest, to be used in training of the system. As a result, such systems are often limited to the detection of a limited set of keywords.
A method which is able to identify handwritten words in a document image quickly without the need for assembling a large collection of training samples of the words of interest is thus desirable for a variety of applications.