This invention relates to techniques for generating and processing self-describing forms. Form processing refers to the process of extracting data from a form, such as the extraction of handwritten or machine printed data from a paper-based form or the extraction of audio data from an audio-based form. For example, sales orders, credit card applications, enrollment questionnaires and surveys can all require the insertion of data onto a printed form by a user, either by handwriting or using a machine, such as a typewriter. Historically, extracting user data from a form required a human operator to read the form and manually key the data into a storage system such as a database—a labor-intensive and therefore expensive and time consuming task.
With the advent of automated form processing technology, including the use of optical character recognition (OCR) and intelligent character recognition (ICR), the task has become more efficient, reducing the need for human operators. A paper-based form that includes form data, that is, the information printed onto the form itself (e.g., the word “Address”), and user data, that is, the information added to complete the form by a user (e.g., the user's address), can be used to create an image file of the completed form. For example, the paper-based form can be image scanned to create a PDF or TIFF file. A program receives the image file as input, locates the user data, and translates the images forming the user data into character codes, for example, ASCII, and may output a text file. The program can be an OCR program, which is typically used to recognize machine-printed characters, an ICR program, which is typically used to recognize handwritten characters, or a program that can perform both OCR and ICR. Hereinafter, the term “OCR/ICR program” shall be used to refer to a program that can perform either OCR, ICR or both. The OCR and ICR processes typically involve complex image processing algorithms and may require manual proof reading to correct inaccuracies.
In order to distinguish between forms data and user data, information can be provided to the OCR/ICR program that identifies locations on the form where user data is expected to be found, typically referred to as zoning information. Additional information can be provided, that identifies certain aspects of the user data expected to be found at a particular location. For example, with respect to a form field requesting the user's social security number, information can be provided to the OCR/ICR program specifying that a numerical value is expected. When performing character recognition, the OCR/ICR program will therefore not mistake, for example, the number “1” with the letter “1”.
One conventional method of making zoning and other such information accessible to an OCR/ICR program is to maintain a catalog of information related to a set of forms, which is accessible by the OCR/ICR program, for example, via a networked database. In order to use the catalog, the OCR/ICR program first identifies the form, so that the corresponding zoning information can be retrieved. A form identifier can be encoded onto the form, for example, using a two-dimensional (2D) graphical symbol, such as a 2D barcode. The OCR/ICR program reads the barcode, learns the identity of the form, and looks up the corresponding zoning information in a catalog accessible by the OCR/ICR program. Alternatively, a barcode can encode a URL address, which the OCR/ICR program can use to retrieve the corresponding zoning information from a remote location, for example from the location specified by the URL and using an Internet connection. The zoning information can then be used to facilitate the processing of the form, as described above.