Often, key corporate performance information contained in electronic reports is simply not available in a lower level system in a structured format. The primary reason is that substantial business logic often resides in report authoring tools. Such business logic could be located within ETL (Extract, Transform, and Load) processes so that the outputs would be available in a structured manner (typically the relational database management system) to subscribing systems for presentation. There are many reasons why this is often not the case in the real world. With the high incidence of merger and acquisition activity, it is often the case that at a given point in time, a corporation may have multiple ERP (Enterprise Resource Planning) systems, multiple data warehouses with disparate ETL tools and processes, and key business information residing in Excel® spreadsheets, Access® databases, and other similar data formats. With this reality, it is not surprising that much of the integration required to present key performance information is ultimately accomplished within the reporting environment where information can be integrated and cleansed much more rapidly than changes can be made to the data warehouses or source systems.
For the above reasons, reports are sometimes the best source for some systems to retrieve certain types of information. However, the problem that is quickly encountered is that most reporting tools do not provide a means to access the information as it appears in the report. For example, in the BUSINESS OBJECTS® report tool it is possible to obtain the data in the data provider, but this is prior to any calculation or formatting. It is likely that the report tool vendors do not provide this capability because the report is considered to be the final output of the system, not as a data source for higher-level presentation. Some companies have attempted to solve this problem of obtaining information from reporting systems by “scraping” a document that is intended primarily for viewing. Screen scraping has numerous limitations and does not allow the underlying data to easily be presented in different ways.
Many reporting systems have the ability to produce the reports in HTML or other similar formats. Several systems have been developed for the purpose of converting HTML pages or other such documents to structured formats, such as XML. The immediate problem these systems encounter is that HTML is not a structured data source. Each of these systems suffer significant limitations when the system is applied to documents with complex layouts and multi-dimensional relationships, such as business reports. These systems extract information from fairly simple HTML documents that are published on the Web and contain content that is semi-structured in one or more basic tables. Most of these systems rely on the structure of the document as a basis for evaluating the relationship between data elements within the document. While this is useful for fairly simple documents, especially those manually coded for the Web, the reliance on internal document structure breaks down completely for documents that have complex layouts with multiple dimensions, cross-tabs and multiple nested tables.
Many of the current systems do not consider the hierarchical nature of information in reports. Other systems that do treat information hierarchically still fail to capture the multi-dimensional nature of the information and often rely heavily on the underlying document structure for the definition of the relationship. Thus, while they are able to map several columns of an HTML table to a tree, they are not able to handle multi-dimensional cross-tab reports with multiple nested tables. Thus, further advancements are needed in these areas.