The present invention relates to information retrieval systems and, in particular, to "report mining" systems for processing report-based data so that the data is susceptible of electronic access and interrogation.
Much business information is contained in reports of many different types. For example, such reports may include external reports for use in communicating with the outside world, such as invoices, statements, purchase orders, financial reports and the like, and internal reports for use in management of the business. While such reports may be presented in printed media, they are typically generated with the aid of computers and are essentially page-based documents designed to present function-related information in a format easily understandable by the end user for satisfying the user's requirements. Thus, space-saving techniques are commonly used in the design of reports to fit them on a printed page. For example, headers are printed only at the beginning of a section or the top of the page; transactions of a particular type may be grouped together and labeled only once, etc. To make sense of the report, the end user mentally links these various pieces of information together as he or she reads.
Computer storage of such reports can be effected through a technology known as Computer Output to Laser disk ("COLD") storage, but this technique treats computer reports under the same paradigm as any scanned document, i.e., the page paradigm. Formerly, report pages were often converted to a picture format (such as TIFF), which takes up a great deal of storage space. Today, most COLD systems continue under the page paradigm despite the fact that the format restriction has lifted, since it requires much less space to store a page of binary spool file than to store a picture format. When page-based COLD storage systems are asked to find a transaction that meets certain criteria, the computer retrieves either the line that relates to the header or the line that relates to the transaction. It is unable to link the two to put the information into the full context.
Furthermore, much information which is buried in reports is simply unavailable to computer access because, unlike a relational database, the report-based data is not organized in an easily searchable manner. It is possible to reorganize report information by rekeying it into other database-type systems for analysis, but this is an expensive and time-consuming process. Furthermore, the resulting database, while having many promising attributes for information retrieval, is designed to optimize the performance of on-line transaction processing systems, and not to support an end user's ad hoc problem-solving tasks. Also, relational databases typically lack the query tools necessary to empower end users, since they require a comprehension of the technical data schema of the database, which usually requires the services of a database expert.
Accordingly, report-mining systems have been provided which essentially process report-based data into a virtual database, which permits the data to be accessible for query by ordinary end users, as if the data were in a database, while retaining the inherent logic of the report design and the look and feel of the picture or image format of the report. One such report mining system is provided by Microbank Software, Inc. under the trade designation "STORQM 2.X" This system is based on the premise that there exists an organizational hierarchy in a report, i.e., that all of the data in a report appear in a structured fashion and are related by being in the same report. Thus, related data fields typically appear together on a report in a pattern of fields. The system defines a pattern as being a set of contiguous data fields, i.e., a block of data that can be defined over many contiguous lines, a single line or a portion of a line. Patterns can also be defined at certain pre-specified and fixed locations on a report page. The system operates to identify and define these patterns and the organizational hierarchy or "view", that exists among them in a report, and then utilizes the pattern definitions and the hierarchy to create virtual "records" that can be derived in response to queries.
While that system is effective, it requires considerable user activity and entails certain inflexibilities and ambiguities. Thus, for example, the user must select from the report a collection of data blocks which the user believes should comprise a pattern, so that the user essentially manually initially determines the patterns before the system abstracts the patterns from the sample data blocks selected by the user.
Also, because of the way patterns are defined, the system can allow a region of text in a report to be matched by more than one pattern, which implies that the same region of text can have different semantic meanings. Furthermore, there is little indication in the system when a region that is meant to match a pattern does not, which impairs the confidence in the reliability of the extracted data. Also, there is no cross validation between patterns defined, i.e., many patterns can overlap in definition and or be exactly identical. Also, the hierarchies or "views" of the overall report abstracted by the system can overlap or even be exactly identical, which can impair the querying function.
During the data extraction process using the prior system the construction of virtual records can be upset by interrupting patterns, such as headers which are repeated for readability rather than because they carry useful information. Also, the system does not allow interruptions between lines in a multi-line pattern. Such interruptions can often occur in the form of page breaks and insignificant headers, thereby artificially forcing definition of separate patterns before and after the page break or header.