The present invention is in the field of data processing, and more particularly, relates to optical character recognition.
Optical character recognition (OCR) systems have been developed to permit entry of textual material on a printed page into a data processing system. Such systems typically require the input text to be composed of symbols having a specially designed type font, where the various symbols-to-be-recognized are positioned on a page in accordance with a well-defined set of rules. The symbols-to-be-entered are defined by regions on the page having an optical characteristic in a first range against a background having an optical characteristic in a second range. For example, the optical characteristic may be reflectivity, and a symbol defined by "black" regions against a "white" background.
For a typical OCR system, a page of text-to-be-entered is initially fed into an optical scanning device where contiguous elemental areas (i.e. picture elements, or pixels) in the text are successively scanned in a raster pattern. A video scan data signal is generated which is representative of the reflectivity of the succession of scanned pixels. The OCR system then processes this digital scan data signal to identify, or recognize, the various characters. This recognition processing generally requires a first step of image segmentation, or identification of data fields containing a single character from the data representative of the line of symbols. Conventionally, the OCR systems rely on "white" space between the characters (horizontally) and beneath the lines of symbols (vertically) for effecting character isolation prior to recognition. Then the isolated character is selectively processed to detect various shape features. A number of optical character feature extraction and recognition techniques are known in the art, such as that disclosed in U.S. Pat. No. 3,930,231. An exemplary system which performs optical character recognition is the Model "Typereader 2" System, manufactured by Hendrix Electronics, Inc., Manchester, N.H.
The "blank scan", or "white swath" (WSW), technique is often used for image segmentation. Generally, vertical regions (or swaths), or two or more adjacent regions, of white space are identified to denote an intercharacter boundary region. This WSW technique is usually adequate for all sans-serif type styles, provided there is sufficient horizontal resolution in the image sensing device. By way of example, the Hendrix Typereader 2 provides intercharacter identification, or segmentation, based on detection of a single white vertical swath (approximately eight mils wide, sampled at 4.7 mil intervals). This is generally adequate since 10 pitch OCR characters typed on 12-pitch have at least a minimum width region of white space between characters. Also, for most purposes, the WSW technique is adequate for any 12-pitch type style (e.g. Courier 12) when typed on 10-pitch.
However, conventional serif type styles (e.g. Courier 12 (at 12-pitch), Courier 72, Prestige Pica, Delegate), have certain characters which occupy full pitch bands. Furthermore, while typewritten character images are nominally centered in the pitch band, in practice, well-aligned typewriters are the exception rather than the rule. The generally encountered mis-alignment results in occasional, and sometimes more than occasional, touching between less than full width characters as well. In such cases, the blank scan, or WSW, segmentation technique is generally inadequate, and more sophisticated segmentation techniques are necessary for machine recognition of characters in such text.
There are a number of segmentation techniques which have been developed into prior art for particular use with complex segmentation problems. These techniques may be referred to by the terms "forced pitch", "blank scan", edge detection", stream following", "recognition feedback", and "post-processing".
The "forced pitch" approach was one of the earliest approaches to segmentation of text having touching characters and utilized a "fly-wheel" approach, where characters were segmented based on a known pitch (e.g. every tenth of an inch). This technique may be adapted for use proportional pitch spacing, for example, with IBM Executive typewriters where the basic pitch varies between two to five increments, depending on the character. For example, see J. Rabinow, "The Present State of the Art and Reading Machine", Pattern Recognition, L. Kanal, ed., Thompson Book Company, Washington, D.C. 1968. While this flywheel approach is generally adequate on well adjusted typewriters, it is ineffective where there is character crowding (such as may be due to typewriter misalignment) or where the print quality or sensor optics is sufficiently degraded so that blending of adjacent images occurs. This approach is commonly used in conjunction with one or more of the other segmentation techniques to provide a "last resort" segmentation decision in the event that the other techniques are inconclusive.
The blank scan technique, as is the forced pitch technique, is usually used in conjunction with other procedures. For example, in the blank scan technique, segmentation may be permitted only within a predetermined region, for example, from the midpoint of the pitch band to the end of the pitch band. In U.S. Pat. No. 3,526,876, the WSW approach extends the definition of a blank scan to include a scan containing only one black bit as well. In that patent, segmentation may occur if three such successive scans occur anywhere in the pitch band, or if one such scan is detected in the last quarter of the pitch band (from the start of the character edge detect). A "Serpentine White" technique is a refinement to the basic blank scan technique, whereby non-touching characters can be segmented. This Serpentine White technique requires the detection of a continuous white path (or, "snake") between character images. This snake may be entirely vertical, or may zig-zag from top to bottom of the text stream. This Serpentine technique is effective for separate, and even overlapping, characters. However, it is ineffective with touching characters.
The "edge detection" technique requires detection of features relating to leading edge and trailing edge character information. While not applicable in general, this technique may be "tuned" to a particular font, for example, as suggested by Baumgartner, Beuttner, et al., "Left Side Detection Segmentation", IBM Technical Disclosure Bulletin, Vol. 17, No. 2, July, 1974.
A number of relatively easily implementable functions are generally used to indicate leading and trailing edge properties. For example, these functions may typically use arguments of the type:
______________________________________ (B).sub.i = .SIGMA..sub.j B.sub.i(j) sum of black bits/swath i (BB).sub.i,i+1 = .SIGMA..sub.j B.sub.i(j) B.sub.i+1(j) sum of adjacent black bits/swath pair (BW).sub.i,i+1 = .SIGMA..sub.j B.sub.i(j) W.sub.i+1(j) sum of adjacent black-white pairs/swath pair (WB).sub.i,i+1 = .SIGMA..sub.j W.sub.i(j) B.sub.i+1(j) sum of adjacent white-black pairs/swath pair ______________________________________
Segmentation occurs at some extremum (min, max) in the sequence of values over a regime. For example, the (B.sub.i) function may be used to just identify the swath having the fewest (minimum) black pixels as the segmentation points. Variations on this theme include use of alternate columns to form (BB).sub.i,i+2, (BW).sub.i, i+2 and excluding certain rows from the computation (e.g. top and bottom three rows).
Another class of functions for the edge detection technique is based on contour or profile/height information. For example, top and bottom contours (profiles from above and below) may be used so that the extrema are selected as potential segmentation points.
The "stream following" technique is based on detection of the "ends" of black horizontal regions or "streams", as disclosed in U.S. Pat. No. 4,083,034. In another related approach, the number of streams per swath is tracked, with the resulting sequences then compared against a set of stored patterns for a particular font, as disclosed in Hoffman and McCullogh, "Segmentation Methods for Recognition of Machine Printed Characters", IBM Journal of Research and Development, Vol. 15, March, 1971.
The "recognition feedback" technique, as described in U.S. Pat. No. 4,003,023, utilizes a recognition logic network which in effect samples recognition results at uniform intervals, to provide a code sequence consisting of the character codes and reject codes. Groupings of like codes indicate properly aligned characters within a window. For example, if the string result where EEEH**NMMM, the probable character sequence might be EM.
The post-processing technique disclosed by Rosenbaum & Hilliard, "Multifont OCR Post Processing System,", IBM Journal of Research and Development, Vol. 19, No. 4, July, 1975, is designed to deal with:
"horizontal splitting"--division of abnormally wide characters into two pieces (e.g. m.fwdarw.rn) PA0 "catenation"--combination of two characters into one (e.g. rn.fwdarw.m) PA0 "crowding"--excess overlap of two characters, causing truncation of one of the images of the pair.
The basic approach in these cases is to take advantage of a dictionary to correct mis-spelled words.
While each the above techniques known in the prior art do provide some measure of effective segmentation for characters in optical character recognition systems, none of these techniques provide a full and effective approach. One of the principal problems of these prior art approaches is the general characteristic that overlapping characters are processed while, in effect, "throwing away" overlapped portions of the pattern.
It is an object of the present invention to provide an improved segmentation processor and method for use in optical character recognition systems.