Financial statements such as balance sheets, income statements, cash flow statements, and the like, are commonly generated for businesses. Such statements may be formatted as tables of information, for example, in ASCII text, EBCDIC text, Microsoft Excel spreadsheets, PDF files, Postscript files, HTML documents, or the like. When reviewing such information, humans use inherent layout features, such as alignment and positioning, as clues for interpreting the logical meaning of the information contained therein. While such information is capable of being read and understood by a person, it may not be so easily read and understood by a computer. Therefore, and since human intervention is subject to error, it would be desirable to have a way to identify, extract, and break down the information contained in documents, such as financial statements, so that computers could be used to “understand” such documents. Such documents could then be reconstructed, if desired, into intermediate structured representations of the information contained therein, such as for example, as XML-formatted documents. Thereafter, the intermediate structured representations of the documents could be converted into various formats capable of being integrated with other systems, such as data warehouses, underwriting and origination systems. Having an intermediate structured format would significantly ease integration efforts by providing a single format from which all other formats could be derived. This would make exchanging information between parties and/or businesses much easier than currently possible.
While there are currently systems and methods that allow some such documents to be understood, these systems and methods all impose certain constraints on the documents that are being submitted. For example, they may require that the documents be presented in a standardized format, or they may require that the system have pre-defined information about the format that is expected in the submitted document. For example, commonly-owned U.S. patent application Ser. No. 09/391,573, entitled “Methods and Apparatus for Print Scraping” describes systems and methods for automatically understanding and extracting information from such documents, but these systems and methods require the document type to be pre-classified as to what type of document it is, and they rely on the use of pre-created scripts that operate on a per-customer and/or per-document type basis to map the information contained therein. Additionally, commonly-owned U.S. patent application Ser. No. 09/391,773, entitled “Method and Apparatus for Network-Enabled Virtual Printing” describes systems and methods for capturing information from a document, compiling the captured information into a temporary file, and then communicating the captured information in the temporary file to a remote system where the information can be processed. However, this invention also relies on the use of pre-created scripts that operate on a per-customer and/or per-document type basis to map the information contained therein. It would be desirable to have systems and methods that did not impose such constraints on documents. For example, it would be desirable to have systems and methods that would allow documents to be submitted in any format (i.e., that would allow formats typically generated by commercially-available tools, as well as formats indicative of the financial industry, to be submitted). It would be further desirable to have systems and methods that did not require the use of pre-created scripts to map the information contained therein, instead allowing the information to be automatically understood by the dynamic system.
Additionally, systems and methods for mathematically decomposing table-structured financial documents exist, but they generally comprise identifying totals and subtotals in the documents by successively trying to add up sets of numbers therein. These existing systems and methods are inefficient in design, utilizing brute force techniques to understand table structure instead of making use of textual information within the individual line items to improve efficiency and validate the results. Moreover, the existing systems and methods do not allow for identification and validation of a solution space, but instead allow identification of only the first obvious solution. This is often times inadequate since more complex mathematical structures can be solved by several variations of the mathematical manipulation of line items. It would therefore be desirable to have systems and methods that allow for the automatic mathematical decomposition of financial tables so that totals, subtotals and individual line items therein can be more effectively and more efficiently identified and validated than is currently possible. It would also be desirable to allow such information to thereafter be exported to tools that utilize such information to measure predetermined characteristics of the organization submitting the financial statement.
There are presently no suitable systems and methods available for allowing computers to automatically mathematically decompose table-structured financial documents. Thus, there is a need for such systems and methods. There is also a need for such systems and methods to automatically identify totals, subtotals and individual line items by finding matching values in the document. There is yet a further need for such systems and methods to be capable of partitioning the data values into predetermined sets (i.e., assets, liabilities, and shareholder's equity for balance sheets; operating expenses, other expenses, and other income for income statements; and net cash from operating activities, net cash from investing activities, and net cash from financing activities for cash flow statements). There is still a further need for such systems and methods to be capable of identifying subtotals via several alternative mathematical algorithms, such as by (1) summing all values and then subtracting the sum of successively larger sets of numbers until the result is equal to the corresponding total value, or by (2) summing successively larger sets of sequential line items from the financial statement, including all possible permutations of positive and negative values, to identify a set where the sum is equal to a following line item; doing so in an efficient manner to allow optimal throughput. There is particularly a need for such systems and methods to be capable of automatically mathematically decomposing financial documents into totals, subtotals and individual line items that can then be exported to tools that utilize such information to measure predetermined characteristics of the organization submitting the financial statement. Additionally, there is a need for mathematical decomposition techniques that can extract and test multiple mathematical decompositions of a single financial statement in order to choose the one solution that correctly represents the mathematical construct intended by the authors of the financial statement. There is yet a further need for systems that can manage multiple solutions for a given financial statement, progressively extracting and validating the most obvious solution first by utilizing other non-numeric information contained within the document, to provide the maximum likelihood of success and optimum processing performance. Many other needs will also be met by this invention, as will become more apparent throughout the remainder of the disclosure that follows.