In many document imaging systems, large numbers of forms are scanned into a computer, which then processes the resultant document images to extract pertinent information. Typically the forms comprise preprinted templates, containing predefined fields that have been filled in by hand or with machine-printed characters. Before extracting the information that has been filled into any given form, the computer must first know which field is which. Only then can the computer process the information that the form contains. The same problem is encountered in form documents and tables that are entered into the computer electronically, when there are differences in format or semantics between different forms or tables.
In some applications, such as population censuses and tax processing systems, a variety of different forms are used. Usually a human operator is employed to identify the locations and contents of the fields on the forms and thus to label the fields for the computer. In some cases, when a large variety of form types is provided without prior sorting by type, it is necessary for the operator to preprocess nearly every document before it can be input to the computer. The involvement of the operator increases the cost of document processing substantially.