The present invention is in the field of data processing, and more particularly, relates to optical character recognition.
Optical character recognition (OCR) systems have been developed to permit entry of textual material on a printed page into a data processing system. Such systems typically require the input text to be composed of symbols having a specially designed type font, where the various symbols-to-be-recognized are positioned on a page in accordance with a well-defined set of rules. The symbols-to-be-entered are defined by regions on the page having an optical characteristic in a first range against a background having an optical characteristic in a second range. For example, the optical characteristic may be reflectivity, and a symbol defined by "black" regions against a "white" background.
For a typical OCR system, a page of text-to-be-entered is initially fed into an optical scanning device where contiguous elemental areas (i.e. picture elements, or pixels) in the text are successively scanned in a raster pattern. A video scan data signal is generated which is representative of the reflectivity of the succession of scanned pixels. The OCR system then processes this digital scan data signal to identify, or recognize, the various characters. This recognition processing generally requires a first step of image segmentation, or identification of a data field containing a character from the line of symbols. Conventionally, the OCR systems rely on "white" space between the characters (horizontally) and lines of symbols (vertically) for effecting character isolation prior to recognition. Then the isolated characteristic is selectively processed to detect various shape features. A number of optical character feature extraction and recognition techniques are known in the art, such as that disclosed in U.S. Pat. No. 3,930,231. An exemplary system which performs optical character recognition is the Model "Typereader 2", manufactured by Hendrix Electronics, Inc., Manchester, N.H.
Generally, the prior art OCR systems do not permit the use of underline characters together with a text character, i.e. a text character having a horizontal line positioned beneath the text character. One reason for this limitation is that the normal mode of inputting textual material for the OCR system is by typewriter, and in view of the nominal tolerances of the typewriters, variations in the relative positions of the characters in the input text permit touching of an underline character with a text character. Such touching may occur between a text character and its associated underline character, or between a text character and the underline character associated with a text character in the preceding line of symbols in the text. The composite symbol resulting from the touching text character and underline character is generally not recognizable to the OCR system.
The prior art OCR systems that do process input textual material having underline characters, require a sufficient "white" space between lines of symbols so that scan data representative of regions of the horizontal "black" (of an underline character) bounded by "white" from below can be detected and stripped prior to (or ignored during) the character recognition process. While such systems are effective provided that no touching occurs, the utility of OCR systems, with underline capability, are severely limited in practice because of the tolerance limitations of the input typewriters. Furthermore, the prior art OCR systems providing underline detection require relatively large storage capability to accommodate the data representative of the text, which may be processed off-line.
Accordingly, it is an object of the present invention to provide a system and method for processing data representative of an image containing at least one line of symbols which may include underline characters.
It is another object to provide an underline processor and method which detects underline characters which may touch associated text characters from above and below.
A further object is to provide a "real time" system and method for identifying underline characters in a line of symbols.
Yet another object is to provide a system and method for detection of underline characters in textual material where the underline characters may vary substantially in thickness and vertical registration, or skew.
Another object is to provide a system and method for detection of horizontal line segments having a minimum predetermined length in textual material.