1. Field of the Invention
This invention relates generally to the automated identification of specific forms and documents (hereinafter target forms). In particular, the invention provides for an expedited data capture process using optical imaging technology. By allowing target forms to be automatically identified during the data capture process, an assurance is attained that proper data is captured and the necessity of preprocess sorting of forms is eliminated.
2. Description of the Prior Art
Data capture, a process whereby form data is copied in some manner for input to a database, is a chore many companies undertake for a variety of reasons. For instance, medical offices need to track their patients and put together certain statistical data. The information needed is gleaned off standard forms filled out during each office visit, put into a back office database, and removed in some manner for its intended purpose.
The manual processing of forms is slow and inefficient. This process requires the operator to manually read data off the form and type it directly into the database. The full potential of computers and other digital technologies are unrealized.
In recent years, with the advent of optical imaging capabilities and optical character recognition (OCR) software, data placed on a form can be digitized by such instruments as a scanner or fax machine and the digitized data can be interpreted as text by the OCR software. This OCR software has been embedded into certain data capture software applications (application software) to achieve an automated process that cuts down on the operator's time and improves efficiency. Now the operator need only place a form through a scanning device. The application software converts the digitized images to text and enters it into the database as the software directs. Recognition of the digitized images is extremely accurate. Some application software allows the operator to make corrections to misrecognized text, which is identified as misrecognized through the application software.
The efficiency of the data capture process has improved dramatically over the years, but there are still problems. The application software used today takes data from specified fields of the target form for input into specified fields of the database. Therefore, the application software has to be developed or set up to accommodate a particular form or other similar document type. If what is scanned is not the form intended, the database will receive erroneous data. This occurs frequently when other forms or attachments are mixed in with the stack of forms to be processed. These other forms or attachments may be complementary (complementary documents) to the form subject to data capture (target form), but are nonetheless extraneous and create inefficiencies to this process. To overcome the disadvantage of mixed in complementary documents, a method to identify the target form prior to the data capture process should be implemented.
One such attempt to identify target forms for the purpose of proper data capture is taught in U.S. Pat. No. 5,293,429, by Pizano, et al., entitled, “System and Method For Automatically Classifying Heterogeneous Business Forms,” issued Mar. 8, 1994 (429 patent). In this patent, form identification ins performed through a pattern recognition system that matches the form to one of a predefined set of templates. These templates are exemplars of the forms to be processed. They are scanned, analyzed and stored in a data dictionary for reference. Each of the templates has a unique pattern described by the horizontal and vertical lines that define the form. A recognition phase consists of scanning the data-filled form and matching extracted features of the digitized image, consisting of a set of predefined vertical and horizontal lines, against the set of templates stored in the data dictionary. This is commonly referred to as line template matching. When a match is made against one of the templates, the form is identified and the data capture process begins.
The disadvantage of this type of system is that it is limited to forms that use scannable form features. Many forms today are scanned using dropout scanning. Under this process, form lines, preprinted text and other markings (form features) are drawn in a color similar to the light source used in the scanning device. The scanning device is unable to optically detect images that are in a color similar to their own light source. The purpose of this type of scanning is to prevent misrecognition of data entry characters due to typing or writing on or near the form features. The OCR interpreter's ability to recognize characters decreases substantially when the characters are interfered with; i.e. the lines, markings or preprinted text from the form overlap or approach the entered data. Dropout scanning prevents this from occurring since it only “sees” the data entry characters and not the form features. However, it also prevents the type of business form identification process described in the 429 patent.
U.S. Pat. No. 5,937,084, by Crabtree, et al., entitled, “Knowledge-based Document Analysis System”, issued Aug. 10, 1999 (084 patent), describes another method of identifying forms. The 084 patent describes a system and process whereby extracted features from a subject document are statistically compared with those of sample documents. Under this patent, the compared features are not limited to horizontal and vertical lines. The features include machine print and hand print. The disadvantages of the 084 patent arise with forms that have variable data fields and use dropout scanning. Although the 084 patent may focus on the print of the form for identification, it can only be print that is invariable. Thus, the print must be part of the form itself or data that can only be entered in a singular manner. In the former case, use of dropout scanning would prevent form identification if the print were in color since the scanning device would not “see” 'the print. In the latter case only forms having data fields that do not require variable data input could be identified. Furthermore, if dropout scanning were not used, misrecognitions would be more frequent due to interference with the form features.