Text in electronically encoded documents (electronic documents) tends to be found in either of two formats, each distinct from the other. In a first format, the text may be in a bitmap format, in which text is defined only in terms of an array of image data or pixels, essentially indistinguishable from adjacent images which are similarly represented. In this format, text is generally incapable of being subjected to processing by a computer based on textual content alone and must be segmented into image units for processing as described, for example, in the copending application for "Methods and Apparatus for Automatic Modification of Semantically Significant Portions of a Document Without Document Image Decoding," Huttenlocher et al., Ser. No. 07/795,174, filed Nov. 19, 1991. In a second format, hereinafter referred to as a character code format, the text is represented as a string of character codes (e.g. ASCII code). In the character code format, the image or bitmap of the text is not available.
Conversion from bitmap to character code format using an optical character recognition (OCR) process carries a significant cost in terms of time and processing effort. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and in a decision making process, identified as a distinct character in a predetermined set of characters. As examples of OCR techniques, U.S. Pat. No. 4,864,628 to Scott discloses a method for reading data which circumnavigates a character image. U.S. Pat. No. 4,326,190 to Borland et al. teaches a character feature detection system for reading alphanumeric characters. In addition, U.S. Pat. No. 4,956,869 to Miyatake et al. suggests a more efficient method for tracing contour lines to prepare contour coordinates of a figure within an image consisting of a plurality of lines.
When the electronic document has been derived by scanning an original, however, image quality and noise in its reproduction contribute to uncertainty in the actual appearance of the bitmap. A degraded bitmap appearance may be caused by an original document of poor quality, by scanning error, or by similar factors affecting the digitized representation of the image. Therefore, the decision process employed in identifying a character has an inherent uncertainty about it. A particular problem in this regard is the tendency of characters in text to blur, or merge. Most character identifying processes commence with an assumption that a character is an independent set of connected pixels. When this assumption fails, due to the quality of the input image, character identification also fails.
The following patents illustrate particularly relevant approaches to improving character detection. U.S. Pat. No. 4,926,490 to Mano discloses a method and apparatus for recognizing skewed characters on a document. A rectangle is created around each character image, oriented with the detection orientation rather than the image orientation, and position data for each rectangle is stored in a table. The rectangle is created by detecting a character's outline. U.S. Pat. No. 4,558,461 to Schlang discloses a text line bounding system wherein skewed text is adjusted by analyzing vertical patches of a document. After the skew has been determined, each text line is bounded by determining a top, bottom, left, and right boundary of the text line. U.S. Pat. No. 3,295,105 to Gray et al. discloses a scan controller for normalizing a character in a character recognition apparatus wherein a character is analyzed by determining certain character characteristics including top, bottom, right and left character boundaries. U.S. Pat. No. 4,918,740 to Ross discloses a processing means for use in an optical character recognition system wherein sub-line information is used to analyze a character and identify it. U.S. Pat. No. 4,949,392 to Barski et al. discloses a document recognition system which recognizes an unknown document form by comparison against a library of templates, thus allowing for the intelligent association of text characters in certain locations of the unknown document to aid in the recognition thereof. U.S. Pat. No. 5,142,589 to Lougheed et al. discloses a system for repairing digital images of broken characters which first dilates the character strokes to fill small gaps therein and then erodes the image to conform to the original strokes, thereby producing recognizable characters before separation into individual digits for recognition. U.S. Pat. No. 5,214,719 to Budd et al. teaches a character recognition system and method for teaching and recognizing characters. The method obtains an image, identifies a character, samples the character, and then does a vector correlation of the sample points to stored points of known characters to recognize the character.
OCR methods have sought to segment images in various fashions. For example, U.S. Pat. No. 4,558,461 to Schlang suggests a text line bounding system for nonmechanically adjusting for skewed text in scanned text. The skew angle of the text is then established, following which the text lines are statistically bounded. The actual text data is then rotated according to the orientation established for conventional processing. U.S. Pat. No. 4,809,344 to Peppers et al. teaches preprocessing of character recognition so as to obtain data necessary for character recognition. Page segmentation is performed by simultaneously extracting a plurality of features, separation between lines, separation between characters, and separation between the lines and the characters are simultaneously performed, and a calculation time for normalizing the separated individual characters can be reduced, thereby performing preprocessing required for character recognition systematically at high speed.
OCR methods have sought to improve reliability by use of dictionary word verification methods, such as described in U.S. Pat. No. 4,010,445 to Hoshino. However, the underlying problem of accurate character detection of each character in a character string remains. The article "F6365 Japanese Document Reader" Fujitsu Sci. Tech. J., 26, 3, pp. 224-233 (October 1990) shows a character reader using the steps of block extraction, skew adjustment, block division, adjacent character segmentation, line extractions, and character recognition by pattern matching, with dictionary checking, and comparison.
It might be desirable, to identify a set of characters forming a word or character string as such, as shown, for example, in U.S. Pat. No. 2,905,927 to Reed, in which for a text string, a set of three scans across the text, parallel to its reading orientation are employed, each scan deriving information about transitions from black to white across the scan. U.S. Pat. No. 4,155,072 to Kawa suggests a similar arrangement, operable to produce a set of values representative of the leading and trailing edges of the character.
In addition to an OCR system operating on printed or typed textual images, numerous references deal with recognition of handwritten text which has been converted into an electronic representation. U.S. Pat. No. 4,731,857 to Tappert shows processing a word with the segmentation and recognition steps combined into an overall scheme. U.S. Pat. No. 4,764,972 to Yoshida et al. suggests a recognition system for recognizing a plurality of handwritten characters. U.S. Pat. No. 4,933,977 to Ohnishi et al. discloses a method for identifying a plurality of handwritten connected figures, including identifying and prioritizing branches of the connected figures. Finally, U.S. Pat. No. 5,216,725 to McCubbrey teaches a computer system for mail sorting of hand-addressed envelopes that first calculates an interstroke distance for character strokes within a digitized address and then, using the interstroke distance, the strokes are grouped into words for further processing.
The choice of entire words as the basic unit of recognition, has also been considered in signature recognition, where no attempt is made to maintain characters as having separate identities, and is suggested by U.S. Pat. No. 3,133,266 to Frishkopf, which still relies on subsequent feature identification methods for identifying characteristics of the image of the character. Signature recognition has also used comparison techniques between samples and known signatures, as shown in U.S. Pat. No. 4,495,644 to Parks et al. and U.S. Pat. No. 4,701,960 to Scott which suggest that features plotted on x-y coordinates during the signature process can be stored and used for signature verification.
Alternative modes of expressing character recognition are known, U.S. Pat. No. 4,949,281 to Hillenbrand et al. teaches the use of polynomials for generating and reproducing graphic objects, where the objects are predetermined in the form of reference contours in contour coordinates.
Certain signal processing techniques for comparing known signals to unknown signals are available if the word can be expressed in a relatively simple manner. U.S. Pat. No. 4,400,828 to Pirz et al. discloses a spoken word recognizor wherein an input word is recognized from a set of reference words by generating signals representative of the correspondence of an input word and the set of reference words and selecting a closest match. U.S. Pat. No. 4,977,603 to Irie et al. teaches an arrangement for pattern recognition utilizing the multiple similarity method, capable of taking structural features of a pattern to be recognized into account, so that sufficiently accurate pattern recognition can be achieved even when the pattern may involve complicated and diverse variations. "An Efficiently Computable Metric for Comparing Polygon Shapes," by Arkin, Chew, Huttenlocher, Kedem and Mitchell, Proceedings of First Annual ACM-SIAM Symposium on Discrete Algorithms, January 1990 (pp. 129-137) suggests that metrics can be established for shape matching.
The present invention seeks to avoid the problems inherent in OCR techniques, while potentially utilizing the fundamental characteristics of words and text strings. Word-to-word spacing tends to be larger than character to character spacing, and therefore, allows improved isolation and identification of tokens comprised of character strings as compared to identification of individual characters within the tokens. OCR methods, however, tend to require several correct decisions about aspects of a character preparatory to a correct identification, including identification of portions of the character as ascenders, descenders, curves, etc., all of which are fallible. The present invention, on the other hand, facilitates more reliable identification and recognition of sets of connected components (referred to herein as tokens), such as words, symbols or strings of characters. In one embodiment, the present invention employs word boundaries to initially determine characteristics of the text or symbol lines within the image. Subsequently, comparison of the tokens isolated within the boundaries to one another or to known tokens in a token image dictionary may be completed. Hence, classifications of the token are not made until the comparisons occur, thereby eliminating the impact of invalid partial classifications which may cause subsequent erroneous comparisons and decisions.
In examining potential uses of computer processed text, it has been determined that, at least in certain cases, deriving each letter of the word is not required for processing requirements. Thus, for example, in a key word search of a text image, rather than converting, via OCR techniques, each letter of each word, and subsequently determining from the possibly flawed character coding whether one or more key words are present, a computer might instead generate and compare the shapes of tokens within the text image with the shape of a token representing the key word, and evaluate whether the key word is present by token shape comparison. The output of such a system would most likely present an indication of the presence of the key words to an accuracy acceptable to a user. Furthermore, it is believed that the novel method described herein will have processing speed advantages over methods designed for character recognition. Moreover, the present invention may also have applications in image editing systems and is, therefore, not intended to be limited to the embodiment described.
The probability of an incorrect determination of a letter by OCR methods may be relatively low, however, the probabilities are multiplicatively cumulative over an entire word--applying the product rule. Hence, using OCR to convert words into character code strings, prior to searching for or recognizing the words may result in considerable error. The present invention utilizes token level, or in a text recognition embodiment word level, segmentation of the image data to enable subsequent recognition in a manner similar to that which humans use while reading or skimming a text passage. Moreover, the described token shape recognition process has several advantages. First, the bitmap image data is not irretrievably lost, and a reasonable representation of the bitmap remains so that an user may examine a reconstructed bitmap for character, symbol, glyph or word determination, if desired. Second, by utilizing connected components (tokens), each symbolic element (e.g., character) has the context of the entire token (e.g., word) to assist in the token's comparison to other token shapes. For example, the presence of a poorly formed letter in a word token only minimally affects the total identifiability of the word shape, by only slightly decreasing the probability of a match between two compared tokens representing the word. In addition, when considered in comparison with the performance of OCR methods, which are more likely to result in mistakes for words having more characters, the present invention generally enables a more robust word recognition capability.
OCR methods convert from a bitmap to a representative character code, thereby losing the informational content of the bitmap. In general, the process is not reversible to obtain the original bitmap from the character code. However, identification of word tokens based on shape, as described in accordance with one aspect of the present invention, retains bitmap information further into the recognition process, thereby enabling reconstruction of the bitmap.