There are many types of character recognition systems known in the prior art. Typically in such systems, a scanning means such as a flying spot optical scanner scans a medium on which characters are stored and provides one type of output electrical signals in response to the scan of the character and another type of output electrical signal in response to a scan of the background. The character recognition logic receives the scanner output signals and makes a decision as to the character identity. As examples, optical scaners may distinguish black characters from white background, or vice versa, and magnetic scanners may distinguish between characters written with magnetic ink and the nonmagnetic background. Since the signals from the scanner output only inform the logic whether the scanner is instantaneously viewing a spot on the character or a spot on the background, it is necessary to provide position signals to the recognition logic. The position information supplied corresponds to the movement of the scanner.
The scanner usually performs a patterned scan which covers a certain area. In some systems, an entire line may be scanned, using a buffer for storing the information. In other systems, scanning occurs one vertical line of a character at a time. In these latter systems, the patterned scan of one character in a line is followed by the patterned scan of the next character in the line, and so on. When the scan reaches the end of a line it moves to the next line and begins again.
In an office environment, most documents are machine printed in single font style with a fixed pitch, usually on a typewriter. The documents have relatively good print quality and the text contains an inherent systematic grid pattern according to the fixed pitch and line spacing defined by the printing mechanism. Common baselines can be easily visualized from each line of printed characters. Baseline is defined as the bottom horizontal line to an uppercase X. The fixed pitch property has been widely used for segmentation, but baseline information has never been used or mentioned in the prior art. Segmentation refers to the separation of one character or mark from another, either vertically or horizontally, and registration refers to the positioning of the scanning device over the character or mark to be sensed. Possible reasons for this failure to use baseline information in OCR applications are:
1. Some OCR systems are designed for recognizing a line of characters with a special mark at the beginning. The mark is used to assist the OCR machine locate the characters and keep the line skew under control. PA0 2. For some OCR applications, the spacing between lines of printed characters is big enough and/or no underscores and sub or super scripts are allowed in the text. Therefore, character images will never touch vertically with any other characters above or below.
Due to the intrinsic grid pattern existing on most office documents, segmentation may seem to be straightforward. However, it still has its own typical problems on line skew which may be generated originally in the printing process or caused by unaligned scanning, and problems on vertically touching images due to the presence of underscore and sub or super scripts. As to registration, the regular simple boxing technique, in which the OCR scanner is physically mounted to locate it over the center of a character, will not be able to properly register the character images with edge noise or with missing strokes. Besides, boxing registration loses character vertical position information which is quite important for character recognition on some font styles. Under these circumstances, baseline information is found to be very useful to handle these problems efficiently and effectively. Also, by using baseline information in accordance with this invention, the need for sophisticated line finding programs can be avoided for segmenting office-generated documents.