Automatic computer-implemented reading of data from printed forms is typically done in a sequence of three steps. First a form is optically scanned to create an electronic image, which is then written in digital storage as a rectangular array of 0's and 1's representing white and black subareas or pixels. Then the image is processed to extract regions or fields containing the data to be read. Finally, the black and white subimage in each extracted region is interpreted and expressed as an alphanumeric code, such as ASCII or EBCDIC.
The data present in printed forms may be defined as having two aspects: a value and a significance. For example, the word "Yes" is a value that becomes data only when its significance, i.e., the question it answers, is made clear. Printed forms provide a conventional means for recording data in which significance is predefined as a background of text and graphics, such as boxed areas. Since forms are printed mechanically, the background is identical over different instances of the same form. Thus the position of data values on the form is in correspondence with the data significance. Optical character recognition (OCR) devices take advantage of this fact to read data from credit card receipts, billing statements, etc. Such "OCR forms" are designed with data values entered in spaces well separated from background printing to assure that the latter are not erroneously interpreted as data values. Data significance does not appear explicitly, but is stored in the computer and associated with the data values on the basis of position in the image. In some cases, forms are printed in a color invisible to the scanner to avoid a possibility of confusion. Data values are carefully positioned during printing, and the form precisely registered during scanning. All these steps serve to guarantee that the data values are exactly where the reading or scanning equipment performs its extraction process.
In recent years, demand has grown for a capability to capture data from printed forms that do not meet OCR constraints. Forms routinely used in government and commercial operations, such as birth and marriage certificates, are designed to be intelligible to the human eye and brain. While people are sophisticated processors of visual images, they also require that both attributes of a data element, the significance and the value, be present on the document. Thus background printing is provided to supply the meaning of each data field, and lines and boxes are imposed to make clear the association of data value and data significance. The crowded appearance of these "people forms", compared with OCR forms, is a necessary outcome of a requirement to pack a great deal of information into a limited space.
It is likewise difficult to enforce controls in the preparation of people forms. A birth or marriage certificate filled out with a typewriter is registered by eye, often with errors in translation and skew compared to the ideal orientation. Data values may superimpose on the form background as a result. The printing process itself is subject to mechanical slippage that may give the same effect. Finally, mechanical slippage and electronic noise occurring during the optical scanning process present a further source of registration error. This is particularly true if economical general-purpose scanners are used. The net result of all these factors is that printing of a given data value on people forms may be skewed, may overlap boundary lines separating data regions, and even when ideally positioned does not consistently appear in a fixed, predictable region in scanned images of different instances of the form. These difficulties pose severe problems for automatic computer-implemented data extraction, rendering inapplicable the sort of processing used for OCR forms.