1. Field of the Invention
The present invention relates to a system and method to identify and extract data from data forms by identifying data containment locations on the form, classifying the data containment locations to identify data containment locations of interest, and performing a match between recognition results and a predefined set of known labels or data formats to classify and return a data containment label, coordinates and the recognition value of interest.
2. Description of the Related Art
The automation of data processing has experienced significant growth as technology has developed faster and more accurate hardware and software for both inputting and outputting of data, processing data, and other data handling. However, a significant difficulty encountered with even the most advanced data processing system, is the reliance upon data forms. Such data forms have remained substantially unchanged since conception, and have become very familiar and widely used by everyone. For example, data forms associated with tax payments and tax processing such as W-2 forms and 1099 forms are widely used and processed in large volumes, and typically remain unchanged from one year to the next.
In contrast, data processing systems and technologies advance significantly from one year to the next. Unfortunately, the most sophisticated data processing systems must still be compatible to some degree with these data forms. One such aspect of compatibility is the use and implementation of data processing systems to input data to such forms, capture data from such forms, and otherwise process such forms as desired. For example, a person may complete or submit a completed tax form, such as a W-2 form or a 1099 form, which includes a large amount of data of various levels of importance to a specific data processing operation. The number of completed forms, requiring rapid and accurate processing, can be significant, and conventional data processing may require the manual capture and transfer of data between forms and data processing systems. Therefore, a need exists for a system and method to facilitate such processing at greater speeds, accuracy and requiring less labor. One option is the creation and provision of improved automated data capturing from such forms to increase throughput.
Data extraction improvements have been described in a number of documents, including for example, U.S. Patent Publication No. 2012/0027246 issued to Tifford et al. and U.S. Patent Publication No. 2008/0267505 issued to Dabet et al. which describe a widely used technique for basic data extraction, the use of optical character recognition (OCR). Optical character recognition (OCR) is a technology well known to those skilled in the art for use in text identification and extraction from various documents.
Many other attempts have been made to improve upon such automated data capture of Tifford and Dabet. For example, U.S. Patent Publication No. 2010/0161460 issued to Vroom et al. describes a software module which receives source documents and recognizes or extracts information from the documents and even further, associates the extracted data with particular types or fields of information, for example, fields or locations of a form, such as various tax related forms, e.g., W-2, 1098, 1099. That is, in many such applications using optical character recognition (OCR) as in Tifford and Dabet, data locations must be known or identified in some manner to facilitate the optical character recognition (OCR) operations. Therefore, in these applications, the use of forms which can be readily identified and with known data locations is key for automated data capturing. Accordingly, many further systems are directed to identifying the forms being processed.
For example, U.S. Patent Publication No. 2007/0033118 issued to Hopkinson describes a system and method wherein processed forms are first recognized, and then printed materials of specific regions extracted and used. Identification of the form is performed in some manner is these systems to facilitate the optical character recognition (OCR) operations, such as in U.S. Patent Publication No. 2009/0097700 issued to Fishback et al. The Fishback reference describes a system and method wherein the identity of the form is found using a comparison with a library of forms, and then printed materials of specific regions extracted and used. U.S. Pat. No. 7,930,226 issued to Quinn et al. and U.S. Patent Publication No. 2008/0062472 issued to Garg et al. simply allow a user to identify the form being input and from which data is extracted and used.
In a similar manner, U.S. Pat. No. 7,769,646 issued to Wyle describes a system and method wherein the identity of the form is found using identification codes, and U.S. Pat. No. 5,937,084 issued to Crabtree et al. describes a system and method wherein the identity of the form is found using a comparison with models representative of specific forms. U.S. Patent Publication No. 2011/0258195 issued to Welling et al. and related U.S. Patent Publication No. 2009/0116755 issued to Neogi et al. describe a system and method wherein the identity of the form is found using a comparison of the processed form with expected layouts, and in each, the data of specific regions is extracted and then used. The Welling reference further describes using line intersections to aid in the comparison of the processed form with expected layouts.
Still further, comparisons between the region and text can be used to identify and extract data as described in the system and method of U.S. Pat. No. 7,840,891 issued to Yu et al. The Yu reference describes an element extractor and content engine to access extracted elements, identify a contextual relationship between extracted elements, and relate the extracted elements to create a representation of the form. The entire patent disclosures identified above are hereby incorporated herein by reference.
However, in each of the systems described above, improvements are still needed in the ability to identify data containment locations and narrow data extraction from identified data containment locations to only data of interest. For example, a system and method is needed to scan and identify regions of a form quickly and accurately, identify and classify data of the region, perhaps using values used in the identification of the region of the form.
However, one of many problems associated with such forms and not fully addressed by the above patent disclosures, is that the data is semi-constrained data, i.e. the set of data to capture is known, but the location of that data might vary between forms. For example, due to the number of variations in W-2 forms or 1099 forms, it can be impractical to create individual templates, databases or comparison tools as used in many of the patent disclosures identified above for every variation.
Nonetheless, there are characteristics of W-2 forms and 1099 forms that are sufficiently common to allow their use to address the problems of automated data extraction. First, the forms are typically black and white, and thus, are not affected by any color filtering during operations such as scanning. Second, in most cases, the data on W-2 forms and 1099 forms are machine-printed. Third, the majority of W-2 forms and 1099 forms organize data in boxes created by the intersection of vertical and horizontal lines, and each box typically contains a label identifying the box as well as actual data associated with the label in some manner.
Accordingly, a need exists for a system and method for improving automated data identification and extraction from any number of known data forms by rapidly processing forms, identifying data containment locations on the form, classifying data containment locations on the form and recognizing data in the locations to identify locations and data of interest.