Text in electronically encoded documents (electronic documents) tends to be found in either of two formats, each distinct from the other. In a first format, the text may be in a bitmap format, in which text is defined only in terms of an array of image data or pixels, essentially indistinguishable from adjacent images which are similarly represented. In this format, text is generally incapable of being subjected to processing by a computer based on textual content alone. In a second format, hereinafter referred to as a character code format, the text is represented as a string of character codes (e.g. ASCII code). In the character code format, the image or bitmap of the text is not available.
Conversion from bitmap to character code format using an optical character recognition (OCR) process carries a significant cost in terms of time and processing effort. Each bitmap of a character must be distinguished from its neighbors, its appearance analyzed, and in a decision making process, identified as a distinct character in a predetermined set of characters. For example, U.S. Pat. No. 4,864,628 to Scott discloses a method for reading data which circumnavigates a character image. Data representative of the periphery of the character is read to produce a set of character parameters which are then used to compare the character against a set of reference parameters and identify the character. U.S. Pat. No. 4,326,190 to Borland et al. teaches a character feature detection system for reading alphanumeric characters. A digitized binary image is used, characters images are traced from boundary points to boundary points, wherein the transitions are defined by one of eight equally divergent vectors. Character features are subsequently extracted from the vector data to form a feature set. The feature set is then analyzed to form a set of secondary features which are used to identify the character. U.S. Pat. No. 4,813,078 to Fujiwara et al. discloses a character recognition apparatus employing a similar process, where picture change points are identified and accumulated according to direction and background density, and are used to enable more accurate identification of characters which are generally erroneously recognized. Furthermore, U.S. Pat. No. 4,833,721 to Okutomi et al. teaches a similar system, operating on character outlines, which may be employed as a man/machine interface for an electronic apparatus.
Additional references which describe alternative methods and apparatus for identification of characters within a digitized image are: U.S. Pat. No. 3,755,780 to Sammon et al. teaches a method for recognizing characters by the number, position and shape of alternating contour convexities as viewed from two sides of the character; U.S. Pat. No. 3,899,771 to Saraga et al, which teaches the use of linear traverse employing shifted edge lines for character recognition; U.S. Pat. No. 4,817,166 to Gonzales et al. which teaches the application of character recognition techniques in an apparatus for reading a license plate which includes a character alignment section and a correction section; and U.S. Pat. No. 4,566,128 to Araki which discloses a method for compressing character image data using a divided character image to recognize and classify contours, enabling the compressed storage of the character image as a group of closed-loop line segments. In addition, U.S. Pat. No. 4,956,869 to Miyatake et al. suggests a more efficient method for tracing contour lines to prepare contour coordinates of a figure within an image consisting of a plurality of lines.
When the electronic document has been derived by scanning an original, however, image quality and noise in its reproduction contribute to uncertainty in the actual appearance of the bitmap. A degraded bitmap appearance may be caused by a original document of poor quality, by scanning error, or by similar factors affecting the digitized representation of the image. Therefore, the decision process employed in identifying a character has an inherent uncertainty about it. A particular problem in this regard is the tendency of characters in text to blur, or merge. Most character identifying processes commence with an assumption that a character is an independent set of connected pixels. When this assumption fails, due to the quality of the input image, character identification also fails. A variety of attempts have been made to improve character detection. U.S. Pat. No. 4,926,490 to Mano discloses a method and apparatus for recognizing characters on a document wherein characters of a skewed document are recognized. A rectangle is created around each character image, oriented with the detection orientation rather than the image orientation, and position data for each rectangle is stored in a table. The rectangle is created by detecting a character's outline. U.S. Pat. No. 4,558,461 to Schlang discloses a text line bounding system wherein skewed text is adjusted by analyzing vertical patches of a document. After the skew has been determined, each text line is bounded by determining a top, bottom, left, and right boundary of the text line. U.S. Pat. No. 3,295,105 to Gray et al. discloses a scan controller for normalizing a character in a character recognition apparatus wherein a character is analyzed by determining certain character characteristics including top, bottom, right and left character boundaries. U.S. Pat. No. 4,918,740 to Ross discloses a processing means for use in an optical character recognition system wherein sub-line information is used to analyze a character and identify it. U.S. Pat. No. 4,558,461 to Schlang suggests a text line bounding system for nonmechanically adjusting for skewed text in scanned text. The skew angle of the text is then established, following which the text lines are statistically bounded. The actual text data is then rotated according to the orientation established for conventional processing. U.S. Pat. No. 4,809,344 to Peppers et al. teaches preprocessing of character recognition so as to obtain data necessary for character recognition. Page segmentation is performed by simultaneously extracting a plurality of features, separation between lines, separation between characters, and separation between the lines and the characters are simultaneously performed, and a calculation time for normalizing the separated individual characters can be reduced, thereby performing preprocessing required for character recognition systematically at high speed.
OCR methods have sought to improve reliability by use of dictionary word verification methods, such as described in U.S. Pat. No. 4,010,445 to Hoshino. However, the underlying problem of accurate character detection of each character in a character string remains. The article "F6365 Japanese Document Reader" Fujitsu Sci. Tech. J., 26, 3, pp. 224-233 (October 1990) shows a character reader using the steps of block extraction, skew adjustment, block division, adjacent character segmentation, line extractions, and character recognition by pattern matching, with dictionary checking, and comparison.
It might be desirable, to identify a set of characters forming a word or character string as such, as shown, for example, in U.S. Pat. No. 2,905,927 to Reed, in which for a text string, a set of three scans across the text, parallel to its reading orientation are employed, each scan deriving information about transitions from black to white across the scan. When values derived from the three scans are reviewed, the information derived from the combination of three scans forms a unique identifier for a word that may then be compared to preset values for identification purposes. Two problems are noted with this method, first, that the image information or bitmap is lost in the conversion, and secondly, the process is rather gross in nature and depends heavily upon the uniform nature of the character in the image scanned. Loss of the image bitmap is a characteristic of the conversion of a bitmap containing textual information to representative character codes. U.S. Pat. No. 4,155,072 to Kawa suggests a similar arrangement, operable to produce a set of values representative of the leading and trailing edges of the character. From this information a quadratic correlation function is used for comparison to standard character patterns.
In addition to an OCR system operating on printed or typed textual images, numerous references deal with recognition of handwritten text which has been converted into an electronic representation. U.S. Pat. No. 4,731,857 to Tappert shows processing a word with the segmentation and recognition steps combined into an overall scheme. This is accomplished by a three step procedure. First, potential or trail segmentation points are derived. Second, all combinations of the segments that could reasonably be a character are sent to a character recognizor to obtain ranked choices and corresponding scores. Finally, the recognition results are sorted and combined so that the character sequences having the best cumulative scores are obtained as the best word choices. U.S. Pat. No. 4,764,972 to Yoshida et al. suggests a recognition system for recognizing a plurality of handwritten characters. A first memory is used to store isolated characters, and a second memory is used to store information, including interstroke character information, for connecting isolated characters. Finally, U.S. Pat. No. 4,933,977 to Ohnishi et al. discloses a method for identifying a plurality of handwritten connected figures, including identifying and prioritizing branches of the connected figures. Branches having the lowest priority within a recognition block are erased until a recognizable figure is obtained. From the recognition block extends a second block which is analyzed in the same fashion until a second figure is recognized.
The choice of entire words as the basic unit of recognition, has also been considered in signature recognition, where no attempt is made to maintain characters as having separate identities, and is suggested by U.S. Pat. No. 3,133,266 to Frishkopf, which still relies on subsequent feature identification methods for identifying characteristics of the image of the character. Signature recognition has also used comparison techniques between samples and known signatures, as shown in U.S. Pat. No. 4,495,644 to Parks et al. and U.S. Pat. No. 4,701,960 to Scott which suggest that features plotted on x-y coordinates during the signature process can be stored and used for signature verification.
U.S. Pat. No. 4,499,499 to Brickman et al. suggests a method of image compression in which the bitmap representation of a word is compared to a bitmap representation dictionary through superposition of the detected word over the stored word to derive a difference value which is compared to a reference value indicating a degree of certainty of a match. Neither OCR methods which seek to encode a bitmap into characters processable as information by computer or bitmap methods for manipulation of images have proven completely satisfactory for all purposes of text manipulation or processing.
In U.S. patent application Ser. No. 07/459,026, filed Dec. 29, 1989, now U.S. Pat. No. 5,167,016, entitled "Changing Characters in an Image", by Bagley et al, a method is shown for changing characters in text appearing in an image. The character to be changed is identified and if the changed version of the image includes a character not in the text prior to the change, a shape comparing process is used to identify a word containing the newly required character, copy the character, and insert it into its new position. In U.S. patent application Ser. No. 07/459,022, filed Dec. 29, 1989, abandoned, entitled "Editing Text in an Image", by Bagley et al, a method is shown for identifying and changing characters in text appearing in an image.
Alternative modes of expressing character recognition are known, such as U.S. Pat. No. 3,755,780 to Sammon et al., which discloses a method of recognizing characters wherein a shape of the character is represented by the number, position and shape of the character's contours. The number and position of the contour allow each character to be sorted according to these values. U.S. Pat. No. 4,903,312 to Sato discloses a character recognition system with variable subdivisions of a character region wherein a character is read to form a binary image. The binary image is then assigned a plurality of directionality codes which define a contour of the binary image. The binary image is then divided into a number of subregions, each of which has an equal number of directionality codes. A histogram of the directionality codes is calculated for each subregion. The histogram of the binary image is then compared with a number of known character contour histograms. Also, U.S. Pat. No. 4,949,281 to Hillenbrand et al. teaches the use of polynomials for generating and reproducing graphic objects, where the objects are predetermined in the form of reference contours in contour coordinates. Individual characters are represented as a linear field of outside contours which may be filtered, smoothed, and corner recognized before being broken into curve segments. Subsequently, the character is stored as a series of contour segments, each segment having starting points, base points and associated reference contours.
Certain signal processing techniques for comparing known signals to unknown signals are available if the word can be expressed in a relatively simple manner. U.S. Pat. No. 4,400,828 to Pirz et al. discloses a spoken word recognizor wherein an input word is recognized from a set of reference words by generating signals representative of the correspondence of an input word and the set of reference words and selecting a closest match. The word recognizor is used with a speech analysis system. A normalization and linear time warp device is disclosed. The input word and the set of reference words are processed electrically to determine correspondence. U.S. Pat. No. 4,977,603 to Irie et al. teaches an arrangement for pattern recognition utilizing the multiple similarity method, capable of taking structural features of a pattern to be recognized into account, so that sufficiently accurate pattern recognition can be achieved even when the pattern may involve complicated and diverse variations. The method includes the steps of: counting a number of occurrences, within each one of localized regions which subdivides a pattern to be recognized, of local patterns indicating possible arrangements of picture elements; deriving a vector quantity indicating distribution of black picture elements which constitute the pattern, from the numbers of occurrences of the local patterns; calculating multiple similarity, defined in terms of square of inner product of the vector quantity and one of prescribed standard vectors representing standard patterns; and recognizing the pattern by identifying the pattern with one of the standard pattern whose corresponding standard vectors gives the maximum values for the multiple similarity. "An Efficiently Computable Metric for Comparing Polygon Shapes," by Arkin, Chew, Huttenlocher, Kedem and Mitchell, Proceedings of the First Annual ACM-SIAM Symposium on Discrete Mathematics, January 1990 (pp. 129-137) suggests that metrics can be established for shape matching.
All of the references cited herein and above are incorporated by reference for their teachings.