This invention generally relates to the automatic detection and selection of color classified typed or hand written pages of printed text into data base fields of machine readable form through the use of an electronic scanner with an optical character recognition (OCR) routine.
Optical character recognition computer routines are well known and have been for some time used for extracting typed or hand written text from pre-printed forms as well as from free form text documents. Of these two types of documents, only the first has been used for automatically extracting text for inclusion into data base files where the information is content classified into data base fields.
References on optical character recognition abound, and include Vincent et al. U.S. Pat. No. 5,010,580, issued on Apr. 23, 1991; Peter Rudak U.S. Pat. No. 5,014,328, issued on May 7, 1991; Harold E. Clark U.S. Pat. No. 3,938,088, issued on Feb. 10, 1976; Masami Kurata U.S. Pat. No. 4,479,242, issued on Oct. 23, 1984; Lesnick et al U.S. Pat. 4,760,606, issued on Jul. 26, 1988; and Maring et al U.S. Pat. No. 4,812,904, issued on Mar. 14, 1989.
There are presently two basic technologies for extracting data into data base files using the optical character recognition routines. Both technologies include the use of an electronic scanner, a host computer system, optical character recognition software and some form of human intervention for the purpose of classifying the data for the appropriate data base field. Of the two methods, the one which is most commonly used for high volume processing is the pre-printed forms.
Referring to FIGS. 1a and 1b, a pre-printed form 21 is shown, having pre-printed headings 23 existing within well defined regions 25, 27, 29, 31, 33, 35, 37, 39, and 41 corresponding to the data entry field name, address, city, telephone No., occupation, salary, education, zip code and state, respectively. These fields are filled in with sample information shown in FIG. 1a to illustrate the blocked field regions as they would be actually scanned. FIG. 1b is an illustration of a template 43 and is devoid of information to more clearly identify the regions 25-41. In FIG. 1a, one or more areas of the form 21 may exist for identifying the form type, or other designations.
The regions 25-41 are typically boxed to identify to the human filling out the form the areas in which the data belonging within the regions 25-41 will be written. In setting up the process initially, a forms extraction portion of the optical character recognition software is configured to scan and convert characters only which exist within the regions 25-41. The forms extraction software allows the user to build a template 43 which has a screen appearance similar to that shown in FIG. 1b.
The advantages of this system are its simplicity, speed and high degree of accuracy. This system is somewhat desirable when large amounts of data forms are to be converted. Typical users might include the Internal Revenue Service for use on tax forms, medical offices and hospitals with patient records, or insurance companies with claim forms. The disadvantage of this system, is that the data must be located in the regions 25-41 which relates to a classification mask or template 43 represented in FIG. 1b that is programmed to correspond to a data base file.
An alternate system, currently available from several companies, uses the same basic concept of creating what is essentially a template to classify the target data and is illustrated in FIG. 2. A typical computer screen 51 with a mouse cursor pointer 53 enables an operator to select a region or target zone 55 containing text 57 to be optical character recognition converted into ASCII text for inclusion into a data base.
Typically, the operator selects the region by dragging a box 59 around the desired text image. Once all of the regions on the form are defined, the combined group of regions can be saved as a template for use at a later time. A good example of this method is a pre-printed mailing list or a telephone book where the label format or page layout is constant.
The most significant difference from the system described above is that the data does not come from pre-printed forms. In this system, as long as the data is located in the same region from page to page, a template can be created which works similarly to the above described system.
The problems with both of the above described systems include the requirement that the text to be optically character recognized resides in a defined area of the text to be scanned. If there is a variance in a batch of forms received from a printer, or if the forms are copied causing spatial distortion, whole batches of data will be unreadable. In the alternative, a new template must be formed for each set of forms which are at variance with the originally formed template.
The above described system cannot handle free format information. If forms are not available, and there is not sufficient time to make new forms, the collector of the data must either wait until the next available opportunity to gather the data, or collect the data in an unformatted configuration for subsequent copying into a formatted configuration by a clerk, a task requiring non-optimum utilization of labor hours.
Further, in cases where data is received from several sources, in an unformatted form, the above two systems are useless in attempting to scan in the information, especially without recopying onto a form. The use of a scanning system to eliminate the human step of transcribing data alphabetical letter by alphabetical letter in the instance of an unformatted data source is most desirable.
In addition to the above described systems which extract data from forms, artificial intelligence systems are being used for extracting information from free form text documents. One of these systems, is advertised and described by Resumix Corporation and is a part of their product line. Their literature describes the use of a knowledge base to select key words in a scanned text as the objects which are used to categorize a particular portion of data into a data field to be categorized into a data base.
There are several disadvantages to this type of system, such as the slow speed and consequently longer time it takes to process a document, the cost of the system, setup time required to adequately create and fully characterize a custom knowledge base to recognize pertinent key words for a particular application. In most applications utilizing this system, significant human intervention is required in reading the document to decide what action needs to be taken. This task is usually performed on a computer screen.
A major disadvantage of this system is the potential for mischaracterization of the extent of the data to be included within the field. For example, when a key word is recognized, the device must make some arbitrary decision regarding whether or not all of the data surrounding the key word is included in the field. It may include too much information or too little information. Such mischaracterization requires additional significant human intervention in checking the character recognized data.