The present invention relates generally to management of tabular data, and more particularly, to identification, extraction, interpretation and standardization of tabular data from unstructured documents.
Businesses generate a vast amount of information for internal and external consumption, and much of this information is typically included in unstructured documents. A large number of such unstructured documents contain critical data in the form of tables, such as financial statements. Often, businesses are required, by law, to furnish these documents for public consumption. The data in these documents needs to be extracted and structured in a database, for research and analytical purposes. For example, all public companies in the US are required to file a variety of reports with the Securities and Exchange Commission. These filings contain data that is crucial for the investment community and required for research, analysis and compliance purposes. Investment research firms and investors need to structure the data in these filings before they can be used.
By their very nature, unstructured documents make the process of identification, extraction and normalization of such tabular data extremely difficult. In most domains, these documents do not have universally accepted codes or structures, which facilitate the process of structuring data in them. While there are many ways in which these documents can be made readable, e.g., documents formatted in the Portable Document Format [PDF], and accessible, e.g., via the worldwide web, they are usually created by using proprietary formatting and content representation preferences. Each company creates content in these documents the way it sees fit, and formats their presentation. As a result, there is no way of electronically identifying the type of information contained in the documents.
For a computerized program to extract the desired information from the document, the table must be identified and the content within the table parsed and broken down to its constituent parts. Once the content in the table is recognized and broken down, it needs to be interpreted and standardized, as appropriate. Once the data is extracted, it will, in many cases, need to be normalized into a common format. There may be many such normalization formats, and new formats may evolve in specific fields. Current solutions for normalization typically code the normalization logic in a programming language, making it difficult and expensive to introduce changes to it over time.
Current solutions for structuring tabular data in unstructured documents are largely manual or at best semi-automated. In the case of manual solutions, the data is re-entered into an RDBMS [Relational Database Management Systems]. For example, corporate fundamental information from public filings with the SEC are manually re-entered into an RDBMS and made available for the purpose of investment research. In a few cases, semi-automated solutions automate some portions of the process, typically programming a pre-defined set of logic.
The current process of manual re-entry has two major problems. First, the process of manual re-entry and validation is a time-consuming process and affects the timely availability of data. It is also expensive. Depending on the scope of the structuring exercise, a large number of people may need to be deployed to manually re-enter the information contained in these documents, which is then validated and made available for research and analysis purposes. Second, manual data entry is prone to errors, and, despite significant efforts to ensure the quality of the structured data, results in poor data quality.
Semi-automated solutions with programmed pre-defined logic suffer from inflexibility, and are therefore unable to reflect rapid changes in business needs and the environment over time. It is expensive and time-consuming to reflect new logic in such solutions. For example, the document creator may change the formatting and/or the logical organization of the content from one period to the next. Also, since the SEC revises filing requirements routinely, each such revision may require changes being made to the processing logic.
The above-mentioned challenges are significant and suggest a critical need for a fast, flexible and accurate method for identifying, extracting, interpreting and standardizing tabular data in unstructured documents, which also has the capability to self-learn changes introduced by the creator of the document.