1. Technical Field
The invention disclosed broadly relates to data processing systems and methods and more particularly relates to techniques for the repair of character recognition information derived from scanned document images.
2. Related Patent Applications
This patent application is related to the co-pending U.S. patent application Ser. No. 07/870,129, filed Apr. 15, 1992, entitled "Data Processing System and Method for Sequentially Repairing Character Recognition Errors for Scanned Images of Document Forms," by T. S. Betts, V. M. Carras, L. B. Knecht, T. L. Paulson, and G. R. Anderson, the application being assigned to the IBM Corporation and incorporated herein by reference.
This patent application is also related to the co-pending U.S. patent application Ser. No. 07/305,828, filed Feb. 2, 1989, entitled "A Computer Implemented Method for Automatic Extraction of Data From Printed Forms," by R. G. Casey and D. R. Ferguson, the application being assigned to the IBM Corporation and incorporated herein by reference.
3. Background Art
Document forms can be filled out in a variety of ways. The examples of writing methods can include hand printing of block letters, cursive hand writing of characters, impact typing, and printing with a dot matrix printer. There can be a variety of character styles and alphabets used in filling out document forms. Latin alphabets are typically used for document forms filled out in western countries. Kanji and Mandarin alphabets are typically used in some East Asian countries. Hebrew or Arabic alphabets are used to fill out forms in some Middle Eastern countries. And Greek or Cyrillic alphabets are used in some Eastern European countries.
Each of the writing methods and alphabets requires a different, customized character recognition process to convert an image of a field in the document form into an alphanumeric string of coded data.
Errors which occur in the coded data output by a character recognition process, can be repaired if the original meaning of the writer can be inferred from the context of the erroneous data. Since the fields in document forms are categorized by subject matter, such as "Name," "Address," "City," "State," "Zip Code," "Country," etc, the context is already provided for making many error correction inferences. The original meaning of erroneous coded data in the "State" field, for example, can be inferred from correct coded data in the "Zip Code" field of the same document form.
Artificial intelligence (AI) knowledge base techniques can be applied to automatically make error correction inferences for narrow subject matter categories of most fields in a document form. For example, reference lists of common given names can be used for "First Name" fields. Reference lists of city, state or country names can be used for "City," "State" and "Country" fields, respectively. The shorter the reference list, the greater the certainty will be in resolving ambiguous coded data strings for a given field.
The number of different AI reference lists for a given document form can be at least as large as the number of fields on the form. And for each field, if the selection of the appropriate AI reference list is governed by the writing method or by the country where the document form was filled out, then the number of possible AI reference lists for each respective field increases by the number of such variations.
In many applications using document forms for the receipt of information from the public, the forms will be received in batches which are characterized by certain uniformities. For example, an international importer of general cargo will receive bills of lading in a batch with each arriving ship or airplane. If the cargo arriving on a first day was shipped from East Asia, for example, then it is reasonable to expect that some of the bill of lading document forms will have their fields filled out with Mandarin or Kanji characters, as well as with Latin alphabetic characters. On a second day, if the cargo was shipped from Eastern Europe, for example, then it is reasonable to expect that Greek and Cyrillic, as well as Latin alphabetic characters will have been used to fill out the same fields on the bill of lading document forms.
The character recognition processes and the AI error correction processes needed to automatically read the bill of lading document forms for the shipment received on the first day will be different from those which are needed to read the document forms for the shipment received on the second day.
There are still other secondary and tertiary coded data repair processes which can be applied to some of the subject matter fields on a document form. An example of this is a comparison of the corrected coded data derived from the "Name" field, with a data base of customer names for a particular application. If the selection of the appropriate data base is governed by the country where the document form was filled out, then the number of possible data base error correction processes for such a field increases by the number of such variations.
What is needed is a means to select customized character recognition processes and customized coded data error correction processes which are reasonably likely to be needed to automatically process a batch of document forms whose fields have certain, anticipated, uniform characteristics.