Millions of documents are produced every day that are reviewed, processed, stored, audited and transformed into computer-readable data. Examples include accounts payable, collections, educational forms, financial statements, government documents, human resource records, insurance claims, legal papers, medical records, mortgages, nonprofit reports, payroll records, shipping documents and tax forms.
These documents generally require data to be extracted in order to be processed. Data extraction can be primarily clerical in nature, such as in inputting information on customer survey forms. Data extraction can also be an essential portion of larger technical tasks, such as preparing income tax returns, processing healthcare records or handling insurance claims.
Various techniques, such as Electronic Data Interchange (EDI,) attempt to eliminate human processing efforts by coding and transmitting the document information in strictly formatted messages. Electronic Data Interchange is known for custom computer systems, cumbersome software and bloated standards that defeated its rapid spread throughout the supply chain. Perceived as too expensive, the vast majority of businesses have avoided implementing EDI. Similarly, applications of XML, XBRL and other computer-readable document files are quite limited compared to the use of documents in paper and digital image formats (such as PDF and TIFF.)
Ideally, these documents would be capable of being both read by people and automatically processed by computers. Since paper and digital image files comprise an overwhelming percentage of all documents, it would be most practical to train computers to extract data from human-readable documents.
To date, there have been three general methods of performing data extraction on documents: conventional, outsourcing and automation.
Conventional data extraction, the first method, requires workers with specific education, domain expertise, particular training, software knowledge and/or cultural understanding. Data extraction workers must recognize documents, identify and extract relevant information on the documents and enter the data appropriately and accurately in particular software programs. Such manual data extraction is complex, time-consuming and error-prone. As a result, the cost of data extraction is often quite high; numerous studies estimate the cost of processing invoices in excess of ten dollars each. The cost is especially high when the data extraction is performed by accountants, lawyers, physicians and other highly paid professionals as part of their work. For example, professional tax preparers report spending hours on each client tax return transcribing salary, interest, dividend and capital gains data; they also admit to human data extraction/entry accuracies of less than 90%.
Conventional data extraction also exposes all documents in their entirety to data extraction workers. These documents may have sensitive information related to individuals' and organizations' education, employment, family, financial, health, insurance, legal, tax, and/or other matters.
Whereas conventional data extraction is entirely paper-based, outsourcing and automation begin by converting paper to digital image files. This step is straightforward, aided by high quality, fast, affordable scanners that are available from many vendors including Bell+Howell, Canon, Epson, Fujitsu, Kodak, Panasonic and Xerox.
Once paper documents are converted to digital image files, document processing can be made more productive through the use of workflow software that routes the documents to the lowest-cost labor available, either in-house or outsourced, on-shore or overseas. Primary processing can be done by junior personnel; exceptions can be handled by more highly trained people. Despite the potential productivity gains that are enabled with workflow software in the form improved labor utilization, manual document processing remains a fundamentally expensive process.
Outsourcing, the second method of data extraction, requires the same worker education, expertise, training, software knowledge and/or cultural understanding. As with conventional data extraction, outsourced data extraction workers must recognize documents, find relevant information on the documents, extract and enter the data appropriately and accurately in particular software programs. Since outsourcing is manual, just as is conventional data extraction, it is also complex, time-consuming and error-prone. Outsourcing firms such as Accenture, Datamatics, Hewlett Packard, IBM, Infosys, Tata, and Wipro, often reduce costs by offshoring data extraction work to locations with low wage data extraction workers. For example, extraction of data from US tax and financial documents is a function that has been implemented using thousands of well-educated, English-speaking workers in India and other low wage countries.
The first step of outsourcing requires organizations to scan financial, health, tax and/or other documents and save the resulting image files. These image files can be accessed by data extraction workers via several methods. One method stores the image files on the source organizations' computer systems; the data extraction workers view the image files over networks (such as the Internet or private networks.) Another method stores the image files on third-party computers systems; the data extraction workers view the image files over networks. An alternative method transmits the image files from source organizations over networks and stores the image files for viewing by the data extraction workers on the data extraction organizations' computer system.
For example, an accountant may scan the various tax forms containing client financial data and transmit the scanned image files to an outsourcing firm. An employee of the outsourcing firm extracts the client financial data and enters it into an income tax software program. The resulting tax software data file is then transmitted back to the accountant.
Quality problems with offshore data extraction work have been reported by many customers. Outsourced service providers address these problems by hiring better educated and/or more experienced workers, providing them more extensive training, extracting and entering data two or more times and/or exhaustively checking their work for quality errors. These measures reduce the cost savings expected from offshore outsourcing.
Outsourcing and offshoring are accompanied with concerns over security risks associated with fraud and identity theft. These security concerns apply to employees and temporary workers as well as outsourced workers and offshore workers who have access to documents with sensitive information.
Although the transmission of scanned image files to the data extraction organization may be secured by cryptographic techniques, the sensitive data and personal identifying information are in the clear, i.e., unencrypted, when read by data extraction workers prior to entry in the appropriate computer systems. Data extraction organizations publicly recognize the need for information security. Some data extraction organizations claim to investigate and perform background checks of employees. Many data extraction organizations claim to strictly limit physical access to the rooms in which the employees enter the data; further, such rooms may be isolated. Paper, writing materials, cameras or other recording technology may be forbidden in the rooms. Additionally, employees may be subject to inspection to ensure that nothing is copied or removed. Since such seemingly comprehensive security precautions are primarily physical in nature, they are imperfect.
Because of these imperfections, lapses in physical security have occurred. For example, Social Security Numbers and bank routing numbers are only nine digits; bank account numbers are usually of similar length. Memorizing these important numbers would not be difficult and would allow a nefarious employee to have direct access to the money held in those accounts. For example, in 2004 employees of MphasiS in Pune, India allegedly stole $426,000 from Citibank customers. The owners, managers, staff, guards and contractors of data extraction organizations may misuse some or all of the unencrypted confidential information in their care. Further, breaches of physical and information system security by external parties can occur. Because data extraction organizations are increasingly located in foreign countries, there is often little or no recourse for American citizens victimized in this manner.
Information security has been the identified for seven consecutive years as the most important technology initiative by the Top Technology Initiatives survey of the American Institute of Certified Public Accountants (AICPA.) National and state laws have been enacted and new regulations have been implemented to address these security concerns, particularly those related to outsourced data extraction that is performed offshore.
The third general method of data extraction involves partial automation, often combining optical character recognition, human inspection and workflow management software.
Software tools that facilitate the automated extraction and transformation of document information are available from several vendors including ABBYY, AnyDoc Software, EMC Captiva, Kofax and Nuance. The relative operating cost savings facilitated by these tools is proportional to the amount of automation, which depends on the application, quality of software customization, variety and quality of documents and other factors.
Automation requires customizing and/or programming data extraction software tools to properly recognize and process a specific set of documents for a specific domain. Because such customization projects often cost upwards of hundreds of thousands of dollars, data extraction automation is usually limited to large organizations that can afford significant capital investments.
The first step of a partially automated data extraction operation is to scan financial, health, tax and/or other documents and save the resulting image files. The scanned images are compared to a database of known documents. Images that are not identified are routed to data extraction workers for conventional processing. Images that are identified have data extracted using templates, either location-based or label-based, along with optical character recognition (OCR) technology.
Optical character recognition is imperfect, often mistaking more than one percent of the characters on clean, high quality documents. Many documents are neither clean nor high quality, suffering from being folded or marred before scanning, distorted during scanning and degraded during post-scanning binarization. As a result, some of the labels needed to identify data are often not recognizable; therefore, some of the data cannot be automatically extracted.
Using conventional software tools, vendors report being able to extract up to 80-90% of the data on a limited number of typical forms. When a wide range of forms exists, such as the 10,000 plus variations of W-2, 1099, K-1 and other personal income tax forms, automated data extraction is quite limited. Despite years of efforts, several tax document automation vendors claim 50% or less data extraction and admit to numerous errors with conventional data extraction methods.
Correcting errors entails human inspection. Inspection requires workers with the same capabilities of data extraction workers, namely specific education, domain expertise, particular training, software knowledge and/or cultural understanding. Inspection workers must recognize documents, find relevant information on the documents and insure that the data has been accurately extracted and appropriately entered in particular software programs. Typically, any changes made by inspection workers must be reviewed and approved by other, more senior, inspection workers before replacing the data extracted by optical character recognition. Because automation requires human inspection, source documents with sensitive information are exposed in their entirety to data extraction workers.