Optical character recognition (OCR) is a computer-based translation of an image of text into digital form as machine-editable text, generally in a standard encoding scheme. This process eliminates the need to manually type the document into the computer system. A number of different problems can arise due to poor image quality, imperfections caused by the scanning process, and the like. For example, a conventional OCR engine may be coupled to a flatbed scanner which scans a page of text. Because the page is placed flush against a scanning face of the scanner, an image generated by the scanner typically exhibits even contrast and illumination, reduced skew and distortion, and high resolution. Thus, the OCR engine can easily translate the text in the image into the machine-editable text. However, when the image is of a lesser quality with regard to contrast, illumination, skew, etc., performance of the OCR engine may be degraded and the processing time may be increased due to processing of all pixels in the image. This may be the case, for instance, when the image is obtained from a book or when it is generated by an imager-based scanner, because in these cases the text/picture is scanned from a distance, from varying orientations, and in varying illumination. Even if the performance of the scanning process is good, the performance of the OCR engine may be degraded when a relatively low quality page of text is being scanned.
One part of the OCR process identifies textual lines in a bitmap of a textual image. One component of the OCR engine segments each textual line with a series of chop lines that are located between adjacent characters or glyphs. Ideally, a single character or glyph is located between pair of adjacent chop lines. In many cases, however, it is difficult to segment words into individual symbols due to poor image quality, font weight, italic text, character shape, etc.