Text in electronically encoded documents (electronic documents) tends to be found in either of two formats, each distinct from the other. In a first format, the text may be in a bitmap format, in which text is defined only in terms of an array of image data or pixels, essentially indistinguishable from adjacent images which are similarly represented. In this format, text is generally incapable of being subjected to processing by a computer based on textual content alone. In a second format, hereinafter referred to as a character code format, the text is represented as a string of character codes (e.g. ASCII code). In the character code format, the image or bitmap of the text is not available.
Conversion from bitmap to character code format using an optical character recognition (OCR) process carries a significant cost in terms of time and processing effort. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and in a decision making process, identified as a distinct character in a predetermined set of characters. As examples of OCR techniques, U.S. Pat. No. 4,864,628 to Scott discloses a method for reading data which circumnavigates a character image. U.S. Pat. No. 4,326,190 to Borland et al. teaches a character feature detection system for reading alphanumeric characters. In addition, U.S. Pat. No. 4,956,869 to Miyatake et al. suggests a a more efficient method for tracing contour lines to prepare contour coordinates of a figure within an image consisting of a plurality of lines.
When the electronic document has been derived by scanning an original, however, image quality and noise in its reproduction contribute to uncertainty in the actual appearance of the bitmap. A degraded bitmap appearance may be caused by an original document of poor quality, by scanning error, or by similar factors affecting the digitized representation of the image. Therefore, the decision process employed in identifying a character has an inherent uncertainty about it. A particular problem in this regard is the tendency of characters in text to blur, or merge. Most character identifying processes commence with an assumption that a character is an independent set of connected pixels. When this assumption fails, due to the quality of the input image, character identification also fails. A variety of attempts have been made to improve character detection. U.S. Pat. No. 4,926,490 to Mano discloses a method and apparatus for recognizing skewed characters on a document. A rectangle is created around each character image, oriented with the detection orientation rather than the image orientation, and position data for each rectangle is stored in a table. The rectangle is created by detecting a character's outline. U.S. Pat. No. 4,558,461 to Schlang discloses a text line bounding system wherein skewed text is adjusted by analyzing vertical patches of a document. After the skew has been determined, each text line is bounded by determining a top, bottom, left, and right boundary of the text line. U.S. Pat. No. 3,295,105 to Gray et al. discloses a scan controller for normalizing a character in a character recognition apparatus wherein a character is analyzed by determining certain character characteristics including top, bottom, right and left character boundaries. U.S. Pat. No. 4,918,740 to Ross discloses a processing means for use in an optical character recognition system wherein sub-line information is used to analyze a character and identify it. U.S. Pat. No. 4,949,392 to Barski et al. discloses a document recognition system which recognizes an unknown document form by comparison against a library of templates, thus allowing for the intelligent association of text characters in certain locations of the unknown document to aid in the recognition thereof. U.S. Pat. No. 5,142,589 to Lougheed et al. discloses a system for repairing digital images of broken characters which first dilates the character strokes to fill small gaps therein and then erodes the image to conform to the original strokes, thereby producing recognizable characters before separation into individual digits for recognition. U.S. Pat. No. 5,214,719 to Budd et al. teaches a character recognition system and method for teaching and recognizing characters. The method obtains an image, identifies a character, samples the character, and then does a vector correlation of the sample points to stored points of known characters to recognize the character.
OCR methods have sought to segment images in various fashions. For example, U.S. Pat. No. 4,558,461 to Schlang suggests a text line bounding system for nonmechanically adjusting for skewed text in scanned text. The skew angle of the text is then established, following which the text lines are statistically bounded. The actual text data is then rotated according to the orientation established for conventional processing. U.S. Pat. No. 4,809,344 to Peppers et al. teaches preprocessing of character recognition so as to obtain data necessary for character recognition. Page segmentation is performed by simultaneously extracting a plurality of features, separation between lines, separation between characters, and separation between the lines and the characters are simultaneously performed, and a calculation time for normalizing the separated individual characters can be reduced, thereby performing preprocessing required for character recognition systematically at high speed.
OCR methods have sought to improve reliability by use of dictionary word verification methods, such as described in U.S. Pat. No. 4,010,445 to Hoshino. However, the underlying problem of accurate character detection of each character in a character string remains. The article "F6365 Japanese Document Reader" Fujitsu Sci. Tech. J., 26, 3, pp. 224-233 (October 1990) shows a character reader using the steps of block extraction, skew adjustment, block division, adjacent character segmentation, line extractions, and character recognition by pattern matching, with dictionary checking, and comparison.
It might be desirable, to identify a set of characters forming a word or character string as such, as shown, for example, in U.S. Pat. No. 2,905,927 to Reed, in which for a text string, a set of three scans across the text, parallel to its reading orientation are employed, each scan deriving information about transitions from black to white across the scan. U.S. Pat. No. 4,155,072 to Kawa suggests a similar arrangement, operable to produce a set of values representative of the leading and trailing edges of the character.
In addition to an OCR system operating on printed or typed textual images, numerous references deal with recognition of handwritten text which has been converted into an electronic representation. U.S. Pat. No. 4,731,857 to Tappert shows processing a word with the segmentation and recognition steps combined into an overall scheme. U.S. Pat. No. 4,764,972 to Yoshida et al. suggests a recognition system for recognizing a plurality of handwritten characters. U.S. Pat. No. 4,933,977 to Ohnishi et al. discloses a method for identifying a plurality of handwritten connected figures, including identifying and prioritizing branches of the connected figures. Finally, U.S. Pat. No. 5,216,725 to McCubbrey teaches a computer system for mail sorting of hand-addressed envelopes that first calculates an interstroke distance for character strokes within a digitized address and then, using the interstroke distance, the strokes are grouped into words for further processing.
The choice of entire words as the basic unit of recognition, has also been considered in signature recognition, where no attempt is made to maintain characters as having separate identities, and is suggested by U.S. Pat. No. 3,133,266 to Frishkopf, which still relies on subsequent feature identification methods for identifying characteristics of the image of the character. Signature recognition has also used comparison techniques between samples and known signatures, as shown in U.S. Pat. No. 4,495,644 to Parks et al. and U.S. Pat. No. 4,701,960 to Scott which suggest that features plotted on x-y coordinates during the signature process can be stored and used for signature verification.
Alternative modes of expressing character recognition are known, U.S. Pat. No. 4,949,281 to Hillenbrand et al. teaches the use of polynomials for generating and reproducing graphic objects, where the objects are predetermined in the form of reference contours in contour coordinates.
Certain signal processing techniques for comparing known signals to unknown signals are available if the word can be expressed in a relatively simple manner. U.S. Pat. No. 4,400,828 to Pirz et al. discloses a spoken word recognizor wherein an input word is recognized from a set of reference words by generating signals representative of the correspondence of an input word and the set of reference words and selecting a closest match. U.S. Pat. No. 4,977,603 to Irie et al. teaches an arrangement for pattern recognition utilizing the multiple similarity method, capable of taking structural features of a pattern to be recognized into account, so that sufficiently accurate pattern recognition can be achieved even when the pattern may involve complicated and diverse variations. "An Efficiently Computable Metric for Comparing Polygon Shapes," by Arkin, Chew, Huttenlocher, Kedem and Mitchell, Proceedings of First Annual ACM-SIAM Symposium on Discrete Algorithms, January 1990 (pp. 129-137) suggests that metrics can be established for shape matching.