Some digital documents are natively electronic while others originate as paper documents that were digitized. For example, some electronic documents are created by scanning paper documents. Unlike native documents, scanned documents are typically an image file, which is essentially a photo of the paper document.
Optical character recognition (“OCR”) can be used to convert the image file of the document into machine-encoded text. The resulting file is merely the translation of characters from the image file, but does not provide any meaning of the characters in the file. For example, the OCR file may have converted the characters “assets” and “$1,000,” but there will be no relationship created between this data in which the variable “assets” is assigned a value of “1,000.”
Determining the meaning of data in an OCR file can be important. For example, the meaning may be needed for importation of the data into a third party software system. With such an import, determining where values are positioned for which variables can be difficult. Typically, the document must be mapped to establish the locations of the values for each variable needed in the import. This type of mapping can be time consuming and costly. In cases where the documents are in a standardized format, such as a standard form, each form can be mapped and used to extract data from multiple documents having that standardized format. With non-standardized documents, however, the mapping of each document is generally not feasible because the location of values and variables is unexpected.
This disclosure relates to an automated solution for extracting this data from non-standardized financial documents (e.g., balance sheets, income statements, etc.) for import into financial software systems. The system includes a portal that allows a financial institution's customers to submit financial documents. The system allows a software subscriber to setup recurring mapping rules based upon the contents of the submitted statements, apply these mapping rules to statements each time they are submitted, and produce an extract of data formatted for integration into financial institutions' internal systems. Since this process is automated, the system creates an audit trail each step of the way to provide compliance-oriented monitoring over the data flow from their end-customer into the financial institution.
The system uses existing OCR technologies to map the contents of financial documents' pages. However, this map is consumed by the system with a custom interpreter that makes assumptions based on the fact that the scanned documents are financial statements that typically have some common formatting features. These assumptions allow the system to make sense of financial statements and extract the data properly without an extensive OCR mapping process that is common in OCR scenarios. In other words, the system can make sense of the OCR'ed data to assign certain numbers to variables for export and allows data to be extracted from these statements the first time it is submitted without an expensive one-time effort to setup a specific template for the statement which allows the system to read it with conventional OCR interpreter technologies.