Optical character recognition, or OCR, is the process of transforming a graphical bit image of a page of textual information into a text file wherein the text information is stored in a common computer processable format, such as ASCII. The text file can then be edited using standard word processing software.
When a document is being scanned, the brightness of each dot, or pixel, within the image of the document is stored as a code that represents the tonal range of the pixel. When using a monochromatic scanner, the tonal range varies from pure black to pure white along a gray scale. The code, or gray scale level, is typically four bits, giving a tonal range of 0 to 15, where 0 is typically pure black and 15 is typically pure white. Since each pixel is stored as a gray scale level, the scanner must determine the threshold gray scale level that separates textual information, which is typically toward the lower, black, end of the gray scale, from background information, which is typically toward the higher, white, end of the gray scale.
Prior art devices have created a histogram of the gray scale levels of the pixels within a document, and used peaks found in the histogram to perform the separation. This method works well with simple documents, which typically have black text printed on a white background. Scanning this type of document results in a histogram having two peaks, one representing the text at the lower end of the gray scale, and one representing the background at the higher end of the gray scale. The gray level threshold that separates text from background would be set at the valley between the two peaks.
When the document is more complex, however, this method breaks down. For example if part of the document has dark text on a light background, and part of the document has dark text on a colored background, the histogram may have three peaks, with the middle peak representing the colored background. Alternatively, if the document has dark text on a light background, and also has lighter text on a light background, the document will also have three peaks, with the middle peak representing the lighter text, not background.
A third situation occurs when the text is formed of characters with very thin lines. In this instance, only one peak may occur in the histogram, and this peak will represent the background color only.
There is a need in the art then for a scanning system that can separate text from background in complex documents. There is further need for such a system that can separate text printed on two or more background colors. A still further need is for such a system that will separate text formed from thin lines. The present invention meets these needs.