Optical Character Recognition (OCR) is a useful technique for processing business forms. Machine reading systems can replace several data-entry operators and reduce the expense of data capture.
In general, the first step of the OCR process is electronic scanning of the document and converting all of the information to a digital bit-map. Once the image is captured in an electronic format, the information to be read is separated from the background information--boxes and guide text must be ignored and the filled-out text should be read. Once this separation is accomplished, the electronic image of the text is processed by the OCR algorithm, where the characters of interest are converted to ASCII data.
Almost all OCR systems processing business forms employ the technique of a "drop-out color". By printing documents in a predetermined color (usually a Pastel color) and employing an optical filter of the same color in the electronic scanner, the filled-out text on the document can be separated from the printed form. The color filter causes the scanner to ignore information printed in that color (to the electronic scanner, the form color appears as being equivalent to the white background of the paper). However, since the filled-out text typically is typed or printed in black (or other dark color), this information is captured by the scanner as black. Hence, the pre-printed form is converted to a white background and the filled-out text can be processed readily by an OCR algorithm.
Use of the optical filter works well in this application, but it limits the customer to a very specific color on the form (one that precisely matches the characteristics of the optical filter installed in the scanner). Additional drop-out colors can be included in the scanner by adding additional optical filters. Accordingly, the processing of a particular form would require selecting the proper optical filter and mechanically inserting it prior to processing the form.
However, slight variations in the printing process can produce variability in the actual color of the printed form, thereby reducing the "drop-out" effect. Such changes can cause noise to be added (the scanner sees the pre-printed form information as black instead of white) which may result in the OCR algorithm producing erroneous results. Alternatively, the changing of optical filters to accommodate these slight variations in printing is not practical, since this would require a large inventory of filters, each with slightly different characteristics. Therefore, at present, the only way to control this problem practically is to tightly control the printing process to insure a uniform drop-out color. As a result, OCR Form Reading systems presently in use are generally "closed loop", which means the Forms Processing Firm (such as an insurance carrier) must maintain control over the printing of the forms, because forms created by outside establishments may not read properly due to color variations.
The present invention discloses a method and apparatus for detecting a drop-out color and selecting color filter coefficients automatically in real time by sampling color from each form as it is being processed. The requirement for a separate calibration step is removed when changing a form color. Forms of any color can be processed in an intermixed fashion. However, each form must have a reserved area or section on it containing a sample of the form color.