Appendixes A, B, C, and D, which are part of the present disclosure, consists of three sheets attached herein and are listings of the software aspects of the preferred embodiment of the present invention.
A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
1. Field of the Invention
The present invention generally relates to methods for recognizing and parsing information in a data file, in particular, a method for identifying information such as financial tables in a financial statement contained in an uncoded text file, and parsing and decomposing the information into its constituent parts.
2. Description of the Prior Art
Financial statements of a number of U.S. public corporations are now available electronically from a number of sources and can be obtained via the internet. In the future, all corporations will be required under the law to file their financial statements electronically. A financial statement is required to contain certain tables of information such as balance sheets, income statements, and cash flow statements, and there may be information explaining the tables and other pertinent information regarding the company.
In the electronic format, a file containing the financial statement is typically uncoded, meaning that there are no codes in the file specifically indicating the type of information represented by each line or column of text. Although the file is typically in plain ASCII text and ASCII text is conducive for reading by a person, it is not conducive for processing by a computer. In order to have the computer extract the desired information from the file, the content of the file must be identified, meaning that the various tables in the file must be recognized and the content within each table must be parsed and be broken down to their constituent parts. Once the data is recognized and broken down, it can be normalized and manipulated. For example, the normalized data can be placed in a spreadsheet program or a database program and the performance of the company can be illustrated and analyzed by various mathematical, statistical, or financial models. The relationship between various financial statement entries can be compared and hypothetical situations can be generated and tested. Furthermore, industry analysis can be performed as well by gathering and collating data from the financial statements of several companies. Thus, there is great incentive for identifying and parsing the content of a file containing a financial statement.
There are two important considerations in the process of identifying and parsing of a file containing a financial statement. The first consideration is speed; the second consideration is accuracy.
Once the financial statement of a company is released, it will have immediate impact upon the valuation of the stock of the company. It may also, when combined with information relating to other companies, impact the valuation of the industry. Thus, it is time-critical to have the financial statement available in a form that can be manipulated for analysis. Furthermore, if a large number of financial statements must be processed, a method for processing of the statements must have reasonable computational speed. The financial statement must also be accurately recognized and processed. Inaccurate financial information can have a disastrous impact on the decision making process. It is therefore important that means be available for facilitating timely and accurate analysis of the statements.
A method currently employed by a database company for processing financial statements requires that the information be categorized and manually entered. This is a labor-intensive process that is slow and prone to human error. Hence, there is a need for a fast and accurate method for recognizing and parsing of files containing financial statements.
There are several problems associated with the processing of a file containing a financial statement. First of all, a file containing a financial statement would include tables such as balance sheets, income statements, and cash flow statements. These tables and their locations must be identified and the line items that compose these tables must be identified as well. Referring to FIG. 1a, a portion of an ASCII file containing a balance sheet is illustrated. Within each table, there may be several years of information set out in column form with column headers. The column headers and boundaries for each column need to be identified in order to identify the content of each column for each line item. Note that although the ASCII files may contain some codes indicated in angle brackets, these codes are not always present and are not sufficient as indicators for a program to properly parse the information in the files.
Another problem in the processing of the file is that each entry or line item in the table needs to be identified and recognized. Because the label of a line item in the table may be longer than one line of text, running over to two or more lines of text, the several lines of text need to be properly amalgamated to form the label.
After the entries for a table have been identified, the components of the table and the relationship among the components needs to identified. One approach to this problem is to parse the mathematical structure of the table. In the prior art, parsing typically starts from the top of the table and proceeds to the bottom of the table. This approach proves to be time-consuming and the results produced are unsatisfactory. If there is a mistaken assumption made at the beginning of the parsing process, the mistaken assumption may not be discovered until further down the table, wasting previous efforts. In addition, the number of permutations of parsing path possibilities for this approach is quite large.
After the components making up the table are verified by the parsing process, the components composing the table must be identified and categorized so that the computer can properly process the data.
It is therefore an objective of the present invention to provide an automated method for identifying financial statements stored in uncoded electronic format such as an ASCII file.
It is another objective of the present invention to provide an automated method for identifying financial tables such as balance sheets, income statements, and cash flow statements of a financial statement stored in uncoded format.
It is yet another objective of the present invention to provide an automated method for identifying the line items that compose a financial table.
It is still another objective of the present invention to provide an automated method for amalgamating several lines of text to form the label of a line item.
It is still another objective of the present invention to provide an automated method for parsing the mathematical structure of a financial table.
It is still another objective of the present invention to provide an automated method for recognizing the components of the tables.
Briefly, a preferred embodiment of the present invention provides a process for processing a file containing a financial statement in uncoded format such as a financial statement stored in an ASCII file. Referring to FIG. 2, the starting locations of the tables in the financial statement as indicated by their table titles are identified (block 10). When all the table titles are identified, a table title is then selected for processing (block 12). Typically after the table title, there are the associated column headers for the table, and they are analyzed and determined (block 14). After the column headers, there are lines of text that need to be differentiated into line items, where each line item is composed of a label and/or one or more numbers corresponding to the label (block 16). With these line items, the next task is to parse these line items to verify that these line items make up the table and to identify the components of the table (block 18). If the lines are successfully parsed, the components of the table can be identified and categorized (block 20).
The present invention is implemented using the programming language PROLOG. However, it is to be understood that the present invention is not limited the programming language utilized.
An advantage of the present invention is that it provides a method for identifying the constituent parts of financial statements presented in uncoded format such as an ASCII file.
Another advantage of the present invention is that it provides a method for identifying financial tables such as balance sheets, income statements, and cash flow statements of a financial statement stored in uncoded format.
Yet another advantage of the present invention is that it provides a method for identifying the line items that compose a financial table.
Still another advantage of the present invention is that it provides a method for amalgamating several lines of text to form the label of a line item.
Still another advantage of the present invention is that it provides a method for deriving the mathematical structure of a table.
Still another advantage of the present invention is that it provides a method for recognizing the components of the tables.