The term data warehouse refers in general to a central data collection (usually a database), the content of which is composed of data of a plurality of frequently different data sources. The data are mostly copied from data sources into the data warehouse and stored therein on a long-term basis, primarily for data analysis and for the purpose of ensuring a superordinated data view.
The creation of a data warehouse is based on two governing principles. Firstly, the data are integrated in the data warehouse from distributed and frequently non-uniformly structured data stocks, in order to render possible a global view of the data and a superordinated evaluation based on said global view. Secondly, the use of a data warehouse permits a separation of those data that are used for operative matters (for example, in the context of short-life transactions) from such data that are used in the data warehouse for the purposes of reporting, superordinated data analysis, etc.
In the past, the supplying of a data warehouse was usually effected on a periodic basis, for example in a batch process at the end of the month. In recent years, there has increasingly come about a departure from regular supplying towards supplying of the data warehouse more or less in real time. The background of this development is the requirement of many sectors for immediately available data collections, whilst preserving the separation of operative (data-generating) systems on the one hand and evaluating (data-collecting) systems on the other hand.
Modern operative systems are frequently designed as OLTP systems. The term OLTP (On-Line Transaction Processing) refers to an approach to transaction-based data processing.
In this connection, a transaction is understood to be a sequence of logically coherent individual actions that are combined to form an indivisible unit. It is characteristic of a transaction that the individual actions combined therein are either performed in their entirety or not performed at all. Furthermore, a plurality of transactions may be performed in parallel without giving rise to interactions between them. Each individual transaction is therefore effected “in isolation” from the other transactions.
Building on the transaction paradigm, there ensue common characteristics for OLTP systems. One of these common characteristics is that OLTP systems have multi-user capability. In the context of multi-user operation, a multiplicity of parallel transactions can be generated by different users. OLTP systems are of such design that the transactions are effected in real time (at least in the perception of the users). In addition, the transactions are usually short-lived and standardized, i.e. each OLTP system provides at least a series of predefined transaction types for different applications.
The data elements belonging to a transaction constitute a logical unit, and can be handled in a single data record or in interlinked data records. Provided that all data elements of a particular transaction that are relevant to the data warehouse are delivered together into the data warehouse, the data warehouse provides a view of the data contained therein that is consistent in respect of individual transactions. Particularly in the case of provision for periodic supplying of the data warehouse, such a transaction-consistent view can be ensured without difficulty.
More problematic, however, is the case in which the data warehouse is to be supplied with transaction data in (at least approximately) real time. In this case, anyway, there is no longer transaction-related consistency if the data elements belonging to a particular transaction are supplied to parallel processing branches before being delivered into the data warehouse, since frequently passage through some processing branches is more rapid than through other processing branches. Accordingly, the data elements belonging to a particular transaction would arrive in the data warehouse at different instants, depending on the processing branch through which said data elements had respectively passed. Then, however, a transaction-consistent view of the data provided by the data warehouse would no longer be ensured at any instant.
In order to solve this problem, consideration might be given, in principle, to synchronizing the processing operations in the individual processing branches with each other on a transaction basis. In practice, however, it has been found that the synchronization mechanisms required for this task can only be realized with a comparatively large resource input.
Consideration might also be given to collecting in a database, downstream from the parallel processing branches, the data elements belonging to a particular transaction, and then transferring said data elements collectively into the data warehouse. However, the JOIN operations necessary at the database level for the collective transfer into the data warehouse require a large amount of computing resource, particularly if it is considered that, in the case of large companies such as banks, frequently over a thousand transactions per second are to be copied into the data warehouse also.
The invention is based on the object of providing an efficient technique for continuously supplying transaction data to a data warehouse, which technique, on the one hand, can be realized without a disproportionately high resource input and, on the other hand, is able to offer a consistent data view.