1. Technical Field
The invention relates to a method and apparatus for recognizing and parsing information in a data file. More particularly, the invention relates to an easily edited method and apparatus for parsing dissimilar data to provide a consistent format output.
2. Description of the Prior Art
Computers are increasingly being used to store, manipulate and transfer data. It is therefore critically important to be able to provide this data in a format that can be readily accessed by computer hardware and software systems. Unfortunately, while most commonly-used forms of record data, such as financial statements, have their own internal structures, there is no universal standardized format.
In the past, data from such dissimilar, non-standardized tables has been manually transferred to consistent and compatible formats. However, it has been difficult to efficiently automate the process of providing a consistent format computer output from different record data forms, such as tabular data.
A typical electronic file containing, for example, a financial statement, is uncoded. Thus, there are no codes specifically indicating the type of information represented by each line or column of text. To have a computer extract information from the file, the content of the file must be identified. The various tables in the file must be recognized, and the content of each table parsed and broken down into constituent parts. Once the data has been recognized and broken down, it can be normalized and manipulated.
Such normalized data is readily accessible by spreadsheet or database programs, or can be illustrated and analyzed by mathematical, statistical, or financial models. Financial statement entries can also be compared and analyzed for specific divisions, companies, or throughout the entire industry.
Time and accuracy are important considerations in the preparation of financial statements. Computers can process the financial data much faster than by hand. However, inaccurate information can have a disastrous impact on a company's financial condition. The computerized method must therefore provide either accurate data, or a method for quickly locating and correcting incorrect data.
Ferguson and Kornfeld, A Method For Electronically Recognizing and Parsing Information Contained in a Financial Statement, U.S. patent application Ser. No. 08/497,355, filed Jun. 30, 1995 and incorporated as a part hereof, describes an algorithm for a computerized parsing of financial data. The Ferguson and Kornfeld method uses what they call a "bottom-up" parser algorithm to recognize data lines from a financial statement. The data lines are then reorganized into a consistent electronic format.
The Ferguson and Kornfeld method is specifically adapted for parsing financial statements such as income statements, balance sheets and cash flow statements. Table titles, columns, and line items are identified, and the table end located. Their bottom-up parser processes the line items from the bottom of the table to the top of the table. This bottom-up algorithm uses at least two tests to determine whether constituent line items are to be marked as a block containing the value of the subtotal. If one or more subtotals are located, it is necessary to make another pass through the data to find higher order subtotals.
However, various problems such as incorrect numerical values, sloppy formatting, and inaccurate title formatting may prevent the parsing algorithm from correctly processing the record data. These deficiencies in the input data will cause the parser to occasionally fail. A minor edit by an editor in the source document can often fix the document so that it can be parsed correctly. However, Ferguson and Kornfeld's parsing algorithm does not provide any feedback on why or at what point in the source document the parser failed. Thus, the problems must be manually located.
It would therefore be an advantage to provide a method for parsing data and thereby rendering a consistent format output. It would be a further advantage if such method were adapted for use with an editor interface. It would be yet another advantage if such method provided information to assist the user in detecting problems that cause parsing failure, and activated the editor feature to permit the user to locate and correct such problems.