The present invention relates to data processing, and in particular, to report generation in online analytical processing (OLAP) systems.
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
Relational databases are often used to support transactional data processing applications. For example, an invoicing application may use a relational database to store the invoice data. The database may store the invoice data in a table containing rows and columns. The columns define and categorize the invoicing data; for example, the columns may include “invoice number”, “invoice date”, “customer” and “invoice amount”. The rows correspond to the invoice data itself; for example, one row is that invoice number “12345” was generated on invoice date “Jan. 2, 2011” for customer “USPTO” for the amount of “$100.00”.
The system that implements a transaction application and its associated relational database is referred to as an online transaction processing (OLTP) system. In an OLTP system, there is a need for fast reads and writes of the transactional data. Queries may be performed on an OLAP system; however when the transactional data set is large, performing a query is slow and negatively impacts the performance of the OLTP system.
In order to perform queries without negatively impacting the performance of the OLTP system, online analytical processing (OLAP) systems were developed. (OLAP systems may also be referred to as data warehouses (DW), business warehouses (BW) or data marts (DM).) Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad-hoc queries with a rapid execution time. They borrow aspects of navigational databases and hierarchical databases that are faster than relational databases. The output of an OLAP query is typically displayed in a matrix (or pivot) format. The dimensions form the rows and columns of the matrix; the measures form the values.
Thus, data processing may be performed using both OLTP and OLAP systems together, where the OLTP system handles rapid transaction processing and the OLAP system handles rapid querying and report generation. In order to get the transactional data from the OLTP system into the OLAP system for analysis, traditionally an extraction-transformation-loading (ETL) process is performed. During extraction, the transactional data is extracted from the source OLTP systems. Extraction can be complex when the OLTP systems differ in data organization or data format. During transformation, the transactional data is transformed from the source formats to the desired destination format. For example, amounts in Euros from one OLTP system and amounts in Pesos from another OLTP system may be transformed into amounts in Dollars in the OLAP system. During loading, the transformed data is stored in appropriate data structures in the OLAP system. Since the ETL process involves accessing the OLTP system (which may negatively impact performance), often ETL is performed on a nightly basis when there is otherwise a low load on the OLTP system.
As mentioned above, OLTP systems generally use a relational database that is row oriented. On the other hand, OLAP systems generally use a multidimensional data format. One example of a multidimensional data format is the star schema. (A star schema is a special case of the snowflake schema.) A star schema includes one or more fact tables; each fact table includes one or more dimension tables. The fact table holds the main data. It includes a large amount of aggregated data, such as price and units sold. Dimension tables, which are usually smaller than fact tables, include the attributes that describe the facts. Often each attribute is a separate table for each dimension.
In existent business data analysis tools, to model object/relationship and to design a report is a quite difficult and time-consuming procedure. Even a simple job such as adding one key figure requires deep technical knowledge and takes great effort. More specifically, once the star schema has been defined in the OLAP system and ETL has been performed, to add the key figure may often involve modifying the star schema on the OLAP system, modifying the ETL process performed, modifying the relational data structure in the OLTP system, and performing ETL again once all the modifications have been performed.
To fully understand the situation, let's start by checking an example based on an example data warehouse system to illustrate the limitations and pain points of existent solutions.
Assume the data warehouse customer is Company ABC. ABC already has an analysis report to display the total value of purchase orders placed on each vendor. An example of this report is shown in FIG. 1.
The table in FIG. 1 displays the total value of purchase orders for each vendor. Information is limited in the purchasing area. No information in other areas is involved. E.g., some orders have been paid by ABC and posted to ABC's general ledger for financial accounting, and some orders haven't. A reader cannot tell how much has been paid from this report.
One day, for some reason, the CEO of ABC becomes interested in the influence of purchasing activities on financial statements. She would like to see the total amount of money which has already been paid and posted to the general ledger for each vendor. From business perspective, the requirement is quite simple: add one key figure “Total Amount Posted to General Ledger” to the report of FIG. 1. The new report should be as the table in FIG. 2.
However, that turns out to be a quite challenging task to the BW expert in ABC, who is responsible for supplying the report to the CEO.
The original analysis report of FIG. 1 is based on a BW schema shown in FIG. 3.
In the center of the star schema of FIG. 3 is a fact table Fact_Purchase_Order, with field “Purchase Order ID” as the table key. This fact table contains the purchase order information: how much value is put on each combination of vendor, product and sales organization. Vendor, product and sales organization are dimensions because they represent the measure of sales in one direction. Accordingly, there are three dimension tables (Dim_Vendor, Dim_Product, Dim_Purchase Organization) around the fact table. These dimension tables contain the detailed information of each dimension, such as vendor's country and product's standard price. All these tables are filled with data extracted from ABC's OLTP system according to a nightly ETL process.
However, when it comes to the CEO's new requirement, the BW experts are disappointed to find out that the necessary information to build the new report doesn't exist in the star schema of FIG. 3. That information only exists in ABC's OLTP system. To be more precise, the amount posted to general ledger is contained in financial documents, which are remotely linked to purchase orders and are not part of the star schema of FIG. 3. The relevant database tables of the OLTP system and their relationships are shown in FIG. 4.
Only OLTP experts can easily understand FIG. 4. To put it simpler, the basic logic is: Vendor 400 is referred to by Sales Order 402; Sales Order 402 is linked to Material Document (document of goods receipt) 404; Material Document 404 is linked to Financial Document 406. Most of the link is technically implemented as a foreign key relationship between database tables. But there's one exception: The link 408 between Material Document Item 404b and Financial Document Item 406b is implemented using complex program logic.
Upon current BW infrastructure, the BW experts have to remodel the star schema of FIG. 3 completely. They have to define new dimensions and adjust the original dimensions. The fact table has to be changed to contain the key figures of material documents and financial documents. That is challenging, because we can see that the data structure of the star schema of FIG. 3 is quite different from that of the database tables of the OLTP system of FIG. 4. The OLTP system's data structure of FIG. 4 is organized as a network in which all database tables in various modules are interlinked via relationships. A huge amount of manual effort is necessary to close the gap between the OLTP structure and the star schema.
A more difficult job is to extract data from the OLTP system to fill into the BW system. BW experts need to cooperate with ERP experts to come up with a complex program to do the ETL job. That program contains SQL statements to read data from different database tables, logic to process the data, and logic to navigate from one database table to another. That requires quite a lot of deep knowledge of the OLTP system's internal program logic and database structure.
What's more, even after the remodeling is done and ETL program is written, BW experts have to reload data from ERP to BW. Old data is replaced with new data. That procedure is quite time-consuming, involving quite a few time-consuming activities (database access and network communication). Normally system administrators make it an asynchronous background job and schedule this job to run in the nighttime, in order to avoid a big impact on the daily operation of the enterprise. Even for incremental data transfer, which is done every day after initial load, the impact is still so big that the ETL job has to be run in the night. Therefore, the CEO's best expectation is to be able to view the data which is only valid one day before.
As discussed above, although the business requirement is apparently simple, difficulties for technical implementation are huge. This fact reflects the major pain points of traditional data analysis tools. Let's summarize them as below:
1) Modeling in the OLAP system requires a huge amount of effort to adapt to the data structure of the data source (e.g., the OLTP system). It's especially difficult if the objects are remotely linked via many intermediate objects (e.g., five objects stay between Vendor 400 and Financial Document 406b tables). The user has to write a complex program to read and process data from various database tables. No wonder a small change usually takes a few weeks or months to be done.
2) Deep technical knowledge is critical to the success of such a project. In addition to the knowledge of modeling, a user has to understand the foreign key mapping between database tables in the data source system. Sometimes the program logic (e.g. the logic on 408 to link Material Document Item 404 and Financial Document Item 406) must be understood, too. That logic is more complex than the foreign-key-based relationship. That obviously goes beyond a business user's capacity.
3) Data updating is not fast enough. The analysis tools must go through the time-consuming ETL procedure to load the data. For a user expecting to see the near-real-time analysis report, that's far from sufficient.