1. Field of the Invention
The present invention generally provides click-thru capability in electronic media, including, without limitation, unstructured Hyper-text Markup Language (HTML) files, Portable Document Format (PDF) files, and unstructured text files.
2. Description of the Related Art
All references cited in this specification, and their references, are incorporated by reference herein where appropriate for teachings of additional or alternative details, features, and/or technical background.
Many important decisions are made on the basis of information gleaned from various sources. For example, financial information is often extracted from a number of sources. Investors, auditors, analysts and creditors often depend on such financial information for making investment, credit, advice and resource decisions. Optimally, any financial report should be verifiable, understandable and material. As would be understood, information misrepresented, or absent from a financial report could have far reaching implications for people depending on the information. Incomplete or erroneous data could result in significant financial loss. The efficient collection and auditing of data regarding a company is of paramount interest to investors and creditors.
As in any information gathering, some sources of financial information are considered more reliable than other sources. For example, in light of Sarbanes-Oxley and other financial reporting legislation, financial reports of public companies to national regulatory agencies are considered by many to be generally trustworthy. The difficulty with such reports is that they are often complex, preventing the reviewer to quickly glean the data needed to generate a fully-informed decision. Further, as such reports are mandated at only set points in time, information gleaned from a federally-mandated security filing may be inadequate at a point in time remote from the filing date of the report.
In order to provide persons with readily-digestible pertinent and timely information, a number of organizations are involved in digesting information from multiple sources of data and displaying such information in a user-friendly manner. Such synopsized information may be garnered from disparate sources, or may be calculated from information garnered from disparate or the same source, which may not be wholly evident by the presentation made to the viewer. While the source of the information may be denoted in footnotes, etc. to the synopsis, because of the time involved in retrieving such original sources and reviewing the same, most reviewers rely almost wholly upon the information that is portrayed to them.
For example, data for any particular financial summary may be gleaned from hundreds of pages of financial performance data that are compiled and published multiple times per period. The conventional practice of transferring and collecting data from electronic documents typically requires manually typing data into a new document or performing a traditional cut-and-paste operation if the source data has this capability. Both of these methods are error-prone (with respect to cut-and-paste operations, for example, a failure to cut a single number may have an order of magnitude effect on an overall financial view). Time spent performing these intensely manual processes would be better spent by auditing the data verses performing costly administrative tasks in support of such operations. As would be understood, with so much data to manually collect and audit, errors in transcribing and copying data can hardly be eliminated in any financial compilation. Further, a failure to fully understand the source of information, or the manner in which it was generated, may have serious unintended consequences in decisional matters.
Public companies worldwide are often required by their national laws to produce and publish financial statements so individuals and institutions can make reasonable decisions regarding their relationships with public companies. The majority of this reporting is accomplished by submitting electronic documents to the appropriate government regulatory authorities, such as the U.S. Securities and Exchange Commission. The electronic document format acceptable to different regulatory authorities differs between countries. Presently, such documents may take the form of Portable Document Format (PDF) native files, Portable Document Format (PDF) image files, structured Hyper-Text Markup Language (HTML) documents, unstructured text files and the like. The documents may additionally be heavily formatted for presentation purposes.
Hyper-Text Markup Language (HTML) is a language for the presentation of electronic documents. It is a scripting protocol defining the structure and layout of a page, such as a web page used on the World Wide Web. By use of tags and attributes, a page is assembled to convey a document in a specific format designated by the author. HTML documents were originally intended to facilitate textual presentation using a cross platform protocol when browsing the Internet.
A Portable Document Format (PDF), the de facto standard for file exchange, is a self-contained cross-platform document similar to HTML. PDF documents differ in that they are intended to appear the same whether on paper or on screen, regardless of the computer or printer involved. PDF and HTML documents may both contain images. Unlike HTML documents, however, PDF documents may be highly compressed. Image files, such as those provided for by Portable Document Format (PDF) image files, do not presently provide “cut-and-paste” functionality for the overlying data. PDF documents may be either a “native PDF” file or a scanned image PDF file. Native PDF files are scannable and capable of being printed without the need for PostScript conversion. Native PDF are searchable and are of significantly smaller file size than scanned image PDF files (which must be printed through a PostScript conversion). Some agencies, such as the MSRB, allow native PDF or image PDF filings.
Other than by footnoting or keying in the source of the information, current electronic document data extraction methods do not provide means for collecting and managing the location from which the data was originally sourced by an analyst. An electronic source document presented as an image file, native PDF, etc. must be manually transferred to the new document and manually referenced for cases of auditing. The ability to present an audit function or “click-thru” capability is unknown, particularly with respect to image files, non-structured text and html, and PDF documents.
Through the embodiments described herein, there is disclosed a method and system to capture click-thru data from the electronic media, such as documents for the collection, analysis and auditing of financial information. These methods and systems described are presently not available.