The exemplary embodiment relates generally to recognition of objects, such as text objects in document images and more specifically to a technique for detecting character strings (keywords) in a document image without having to detect or identify the individual characters making up the strings or the full text. The exemplary embodiment finds application in document processing, analysis, sorting, detection, word spotting, and related arts.
Text of electronically encoded documents tends to be found in either of two distinct formats, namely bitmap format and character code format. In the former, the text is defined in terms of an array of pixels corresponding to the visual appearance of the page. A binary image is one in which a given pixel is either ON (typically black) or OFF (typically white). A pixel can be represented by one bit in a larger data structure. A grayscale image is one where each pixel can assume one of a number of shades of gray ranging from white to black. An N-bit pixel can represent 2N shades of gray. In a bitmap image, every pixel on the image has equal significance, and virtually any type of image (text, line graphics, and pictorial) can be represented this way. In character code format, the text is represented as a string of character codes, the most common being the ASCII codes. A character is typically represented by 8 bits.
There are many applications where it is desirable for character strings to be extracted from a document or a portion thereof which is in bitmap format. For example, a document may be available only in a printed version. In the domain of automated document processing, for example, a common task involves the categorization of documents. Many of the documents to be categorized are received in paper form, either because of their “legal” significance, as a backlog of old documents to be archived, or as general-purpose correspondence, and they need to be classified. Various techniques exist for classifying documents, whether based on the aspect of documents, on the textual content, or based on templates. All these techniques have their specific advantages and drawbacks.
By performing optical character recognition (OCR), a document in bitmap format, such as a scanned physical document, can be converted into a character code format, such as an ASCII text format, XML format including text, a format compatible with a selected word processor, or other symbolic representation. The OCR converted document can then be searched for certain keywords or other textual features to, for example, classify documents or identify documents pertaining to a particular subject. OCR has numerous advantages, but is computationally intensive. In many applications, it is not practical to apply OCR to every received document.
There are a number of applications where the identification of whole words rather than individual characters or recognition of the full text is sufficient. For example, in some applications, it may be desirable to identify documents, such as incoming mail, which include any one of a set of triggering words. These documents may then be processed differently from the rest of the mail. For example, an organization dealing with contracts may wish to identify documents which include keywords such as “termination” or “cancellation” so that such documents can receive prompt attention. Other organizations may wish to characterize documents according to their subject matter for processing by different groups within the organization.
It has been shown that identification of whole words is more robust for degraded images containing broken and touching characters. One system, based on hidden Markov models (HMMs), represents words as a concatenation of single-state character HMMs. This system requires segmentation of the characters prior to feature extraction. Another system uses multiple-state HMMs to model characters without requiring segmentation of words into characters. However, segmentation of words into sub-character segments based on stroke and arc analysis is required prior to feature extraction. In both these HMM-based systems, the segmentation can introduce errors at an early stage in processing.
A method which is able to identify whole words in a document image quickly and with a high degree of accuracy is thus desirable for a variety of applications.