1. Field of the Invention
The present invention generally relates to software systems for gathering, storing, and retrieving data. The invention more specifically relates to such software systems wherein the data is text based. Yet more specifically, the invention relates to systems for gathering, storing and retrieving corporate disclosure information.
2. Related Art
The current practice for the gathering, storage, analysis and retrieval of data concerning corporate disclosures revolves around the Electronic Data Gathering, Analysis and Retrieval (EDGAR(copyright)) system of the U.S. Securities and Exchange Commission (SEC). EDGAR performs automated collection, validation, indexing, acceptance and forwarding of submissions made by companies and others who are required by law to file various documents with the SEC. The primary purpose of EDGAR is to increase the efficiency and fairness of the securities market for the benefit of investors, corporations and the economy as a whole by accelerating the receipt, acceptance, dissemination and analysis of time sensitive corporate information filed with the SEC.
The success of a system such as EDGAR in accomplishing the SEC""s goals depends to some extent on whether all required submissions are, or even can be, filed using the system. The success also depends on the ease with which output reports may be generated and the variety and nature of the inquiries which may be made to generate reports using the system. Finally, success of the system will depend on the ease with which accurate submissions, referred to as disclosure documents, may be drafted by those companies and persons required to do so.
In order to ensure that all the data required by the SEC is collected, and further to ensure that this data is searchable, once entered into the EDGAR system, the SEC requires that EDGAR filings be done using standard forms. A list of the standard forms presently accepted for EDGAR filing is included in Appendix A. The dramatic proliferation of form types illustrated by Appendix A is a result of requiring filings to be rigidly standardized. A standard form or variant must be created for every conceivable situation.
Presently, an individual preparing a disclosure document for filing uses a general purpose word processor to create one of the form types accepted for electronic filing. The completed form is filed using an electronic mail system (E-MAIL). However, the EDGAR system provides no methods or processes for companies to uniformly perform quality control, tracking, analysis or process support while preparing a document for filing. The failure to control this part of the filing process can have serious consequences because the EDGAR system permits only the submission of disclosure documents, not the amendment or alteration of disclosure documents.
Since statute and regulation require a large number of filings from a large number of entities, the EDGAR database has grown to enormous proportions. As a result of the size of the EDGAR database, and as a consequence of inconsistencies with respect to how different entities report similar matters, it is inherently intractable to analyze the EDGAR data in a meaningful way. Basic text searches can be performed, but meaningful data reduction is substantially hampered by inconsistencies and by the variety of reporting forms used to report similar information.
The various problems of the conventional corporate disclosure and repository system discussed above and such other problems as will now be evident to those skilled in this art are overcome by the present invention in which there is provided a software system for collecting corporate disclosure information in a free text form, generating a database of disclosure information and analyzing the database of disclosure information.
A software system according to some aspects of the present invention may be fixed in a machine-readable medium. The software system may include synthesis software tools which receive documents including freely formatted text documents and which produce a formatted database of information from the freely formatted text documents; and analysis software tools which receive the formatted database and which produce an analysis output. Variations on this system are possible. For example, the synthesis tools may further include a concept dictionary relating a concept word root, a context word root and an instance word root; a parser which receives the documents and which produces a plurality of words contained in the documents; a rooter which receives the words parsed by the parser and which produces corresponding word roots for the words received; and a contexter which receives the concept dictionary and the word roots and which identifies a concept corresponding to each word root on the basis of the word roots received. Also in accordance with this variation, the word roots may include a target word root and proximate word roots identified by the parser to have originated at a location in the document proximate the target word root. In that case, the contexter may further include a context recognizer which identifies in the concept dictionary all concepts having context word roots matching proximate word roots; and an instance recognizer which identifies in the concept dictionary all concepts previously identified by the context recognizer which also include an instance word root matching the target word root. The analysis tools may further include a diagonalization tool which receives the formatted database and rearranges the formatted database to cluster similar topics together forming a grouped database in which each cluster of similar topics is a group; or a catalog defining each database entry as one of either required or optional. When a grouped database is produced, the analysis tools may further include an inferencer which receives the groups from the grouped database and which produces an inferenced database in which inferences are drawn on the basis of information present in and absent from the groups.