The present invention is directed to a method for extracting raw data from documents including textual information and data information.
It is no secret that we are truly living in the information age. Information is being generated at exceedingly higher and higher rates every day. For many years, this information was only recorded in paper form. However, with the advent of computers, in many instances, this information never even finds its way onto the printed page. Rather, this information is electronically generated and stored in the memory of a vast array of computers.
Although this information can take many forms and can be used for many purposes, due to the regulatory nature of our society, governmental requirements necessitate the compilation and publication of documents relating to, among other things, the business community. For example, the Securities and Exchange Commission (SEC) requires the compilation and publication of various statistics relating to a company""s status. This documentation is generally promulgated on a periodic basis and would include files such as 10-Q or 10-K financial documents which are made available to the public.
As can be appreciated, due to the periodic nature of these publications, the entries included in these and other financial or other types of documents are fairly standard. For example, most of these documents would include one or more lines of textual material and one or more columns of data associated with each of the lines of textual material. Therefore, to properly use the information included in these financial documents, the information contained therein should be scanned into the memory of a computer. By doing such a scanning process, the data associated with the textual strings (usually in tabular form) must be extracted from the financial document in a manner which is effective and accurate. A number of prior art U.S. patents are directed to various systems and methods of extracting tablets from printed documents. One such patent is U.S. Pat. No. 5,956,422, issued to Alam. This patent describes a processor utilizing a method for recognizing, capturing and storing tabular data as a pixel-format document image or as formatted text. The pixel-format document image may then either be directly processed to locate tabular data or may be processed by an optical character recognition system to obtain the formatted text. After locating the tabular data either in a received pixel format document image or in the formatted text, the tabular data is extracted directly from cells present in either form of digital computer data or the tabular data located in the pixel format document image may first be processed by the OCR to obtain formatted text before extracting the tabular data. As illustrated with respect to FIGS. 2a, 2b, 2c, 2d, the purpose of this patent is to merely locate the area of a document in which the tabular data is present and then extract the data from that document. Although the document does contain textual material, the actual textual material is irrelevant to the extraction process.
U.S. Pat. No. 5,953,730, issued to Schawer, shows a system for manipulating spreadsheet program data which appears in tabular format.
U.S. Pat. No. 5,033,009, issued to Dubnoff, describes a method for automating the production of worksheet files used by an electronic spreadsheet program. As shown in FIG. 1, a worksheet file generator 30 operates in response to pattern data 32, variable data 34 and command data 36. However, it would appear that this patent is directed to a method of formulating the electronic spreadsheet and not extracting data from that spreadsheet.
The deficiencies of the prior art are addressed by the present invention which is directed to a method and system for extracting identified data from text blocks, usually included in columns of numbers associated with particular character string definitions. A number of iterative passes are made of a particular document to accurately extract the data schedule as well as the particular data associated with the character strings of a data schedule.
Although the present invention is directed to extracting data from raw SEC documents such as 10-Q or 10-K financial documents which have been, for example, downloaded from a particular website, the present invention is not to be construed as being so limited and would have applicability to any type of document in which one or more columns of numerical data is associated with textual character strings provided in a separate column.
As can be appreciated, many financial documents are published on a periodic basis. Each new addition of this document for a particular company would be very similar to previous documents. Therefore, the present invention would utilize a system in which previously extracted information in prior data reporting periods would be used to search the newly downloaded document for corresponding, or very similar, textual character strings. This similarity includes specific data schedules as well as similar textual strings produced in each of the data schedules. Once the newly downloaded document was properly searched, using previously parsed data schedules, the specific financial data schedules such as balance sheet, income statement and cash flow located within a large aggregate data file would be extracted and stored into data schedule text files as well as tabular files including numerical information. This process is an iterative one and, an operator will be used to physically review portions of the documents in which no character string match has occurred. Once the data schedule is broken into its descriptive text section and the tabular numerical data section, this material can be extracted from the raw document verified and updated if necessary.
An initial text matrix is created containing a row for each row of the data schedule containing a data item. Three columns are associated with each of the rows, one column containing the data, a second column including a database reference number and a third column containing a unit value indication of the sign of the data. The number of rows of the text matrix is provided on a first plane and the three columns which produce the data matrix is also provided on a first plane. The text strings of succeeding documents are searched by comparing them to text strings of the text matrix of the initial document. If a match is found, corresponding information is provided on a second data matrix included in a second plane including the same numbering rows and columns as the first document. No corresponding character string would be included in the appropriate row in a second plane text matrix. Variations of the text string included in the first text matrix plane would be provided in the appropriate location in a second or subsequent text matrix plane. Completely new text strings would also be provided in a new row in the first text matrix plane. Subsequent screening of additional documents would result in the creation of additional text matrix plane and data matrix planes.
It is to be understood that both the foregoing general description and the following detailed description of the present invention are exemplary and explanatory only and are not restrictive of the invention as claimed.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate the present invention and together with the description, serve to explain the principals of the invention.