The present invention relates generally to text recognition and more specifically to a technique for recognizing character strings (keywords) without having to detect or identify the individual characters making up the strings.
Text of electronically encoded documents tends to be found in either of two distinct formats, namely bitmap format and character code format. In the former, the text is defined in terms of an array of pixels corresponding to the visual appearance of the page. A binary image is one where a given pixel is either ON (typically black) or OFF (typically white). A pixel can be represented by one bit in a larger data structure. A grayscale image is one where each pixel can assume one of a number of shades of gray ranging from white to black. An N-bit pixel can represent 2.sup.N shades of gray. In a bitmap image, every pixel on the image has equal significance, and virtually any type of image (text, line graphics, and pictorial) can be represented this way. In character code format, the text is represented as a string of character codes, the most common being the ASCII codes. A character is typically represented by 8 bits.
There are many applications where a document must be converted from bitmap format to character code format. For example, a body of text may be available only in a printed version, and be required to be input to a word processing program for editing. The choice is typically between manually inputting the text, character by character at a keyboard, or scanning the document, and using optical character recognition (OCR) techniques to convert the bitmap image into a character code file. Proofing of the resultant document is usually required.
OCR is a well-developed and continually developing technology, but has inherent weaknesses. When the electronic document has been derived by scanning a paper document, there is an inevitable loss. If the scanned image is a second- or third-generation photocopy, the problem is exacerbated. A particular problem in this regard is the tendency of characters in text to blur or merge. Since OCR is based on the assumption that a character is an independent set of connected pixels, character identification fails when characters have merged. The OCR process carries a significant cost in terms of time and processing effort, since each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and a decision made identifying it as a distinct character in a predetermined set of characters.
There are a number of applications, however, that require only the identification of whole words rather than individual characters. It has been shown that identification of whole words is more robust for degraded images containing broken and touching characters (See Ho, Hull, and Srihari). One system, based on hidden Markov models (HMMs), represents words as a concatenation of single-state character HMMs (See He, Chen, and Kundu). This system requires segmentation of the characters prior to feature extraction. Another system uses multiple-state HMMs to model characters without requiring segmentation of words into characters (See Bose and Kuo). However, segmentation of words into sub-character segments based on stroke and arc analysis is required prior to feature extraction. In both these HMM-based systems, segmentation can introduce errors at an early stage in processing.