Enterprise Data Warehouses serve as a vital platform on which various applications of several companies are embedded. The applications can include Business Intelligence (BI), Customer Relationship Management (CRM), and Enterprise Resource Planning (ERP), for example. As they are embedded in an enterprise-wide (or even world-wide) system landscape, there are often conflicting problems of high volumes of data and a well-defined, narrow time slot for processing the data.
Effective data processing in terms of utilizing available hardware is a key requirement to improve the performance of a data warehouse and to reduce times for providing the data.
As shown in FIG. 1A, a data warehouse application platform 100 will often have a two-tier architecture: one or more application servers 102 of an application layer 101 each host one or more data warehouse applications. The application servers 102 are connected to a database management system (DBMS) 104 of a database layer 103, and different (parallel running) tasks 106 running on each application server 102 have to process data reads from the tables 108 stored on the DBMS 104. The DBMS 104 can include one or more servers. Meanwhile, data targets, such as InfoCubes, DataStore objects, etc., are modeled by data warehouse users to support different applications and decision making. The data to be processed is most often structured and the metadata is composed from data models, in what is known as a model driven architecture. Thus, the semantics of the data is unknown from the perspective of the generic data warehouse application.
The data to be processed is often time dependent. For instance, as illustrated in FIG. 1B, if the data warehouse application extracts billing items from a source system, the order of modifications in the source system for one specific item have to be adhered to in order to calculate the right delta values. To support this requirement, technical keys (e.g. REQUEST, RECORDNUMBER) are used in addition to the semantic keys (e.g. billing number).
Data processing steps (e.g. data activation in a DataStore object, data loads from one data target into another one, etc.) are critical for performance. To be able to deal with mass data, the data processing steps have to be distributed over multiple application servers using different tasks 106, and the data has to be split accordingly. Typically, one task 106 processes only a subset of the data, known as a data package, as shown in FIG. 2. To control the server workload, the number of records to be processed by one task has to be maintained by the user (package size). Furthermore, to avoid data loss due to concurrent tasks, records for one specific semantic key must be in the same package. If the records were processed by different tasks in parallel the precondition mentioned in 005 could be violated. As an example, as shown in FIG. 1B, records with record number 1, 2, 3 and 5 have to be processed by one task; while record number 4 can be processed by a different task.
In current data warehouse environments, task handling according to the conditions described above is one of the limiting factors. A task itself cannot determine the data package to be processed since, due to the model driven architecture, there is no selection criterion which ensures that all records for one specific semantic key are read, that records of any package size are processed, and that each record is processed by exactly one process.
Since a task is not able to select its data, the degree of parallelization is restricted since all tasks are dependent on the main process. Accordingly, there needs to be one main process which creates the data packages and passes the result to a task.