In the prior art, such database systems that are suitable for both online transaction processing and online analytical processing have been designed as what is referred to as in-memory database systems. These database systems use main memory instead of disks to store transaction records. In order to provide both online transaction processing, which is write intensive, and online analytical processing, which is read intensive, the transaction records are organized into two data structures: a main store, which typically is a read-optimized, column-oriented data structure that uses data compression; and a differential buffer, which typically is a write-optimized and column-oriented data structure. The records stored in the differential buffer in conjunction with the records stored in the main store represents the current state of the data. The differential buffer also facilitates a column-oriented, but less compressed data structure that can be updated efficiently. To derive a consistent (=valid, correct, up-to-date) view, read operations must access the main store as well as the differential buffer, while data modifying operations manipulate the differential buffer only. A database architecture that uses a main store and a differential buffer in the above described way is referred to as a “Main/Delta-Architecture”.
The differential buffer grows with every write operation. This decreases read performance because an increased part of the data has to be accessed via a non read-optimized data structure, the differential buffer. To compensate for this effect, the differential buffer and the main store are merged from time to time within the so-called merge process. Such an approach of using a differential buffer for write operations and merging the data of the differential buffer with the main store is described in the book “In Memory Data Management” by Hasso Plattner and Alexander Zeier.
Analyzing records frequently requires calculating aggregates of columns. While in conventional disk-based databases calculating such aggregates is time and resource consuming, modern in-memory database system allow for calculating these aggregates on-the-fly. First, access to main memory is faster than access to disk. Further, column-oriented data structure of modern in-memory database system allows for even faster access to the data that is stored consequentially in a column because CPU manufacturers implement pre-fetching algorithms that will load co-located data into extremely fast CPU cache as soon as the first load is finished. This reduces one of the problems associated with loading data from memory, namely that loading data incurs additional loading latencies, which is why an executing program may frequently stall. This is advantageous because the alternative to calculating aggregates on-the-fly, which is materializing aggregates, requires both memory and computational costs, in particular for maintaining materialized aggregates whenever the base data is changed. Further, the feasibility of calculating aggregates on-the-fly provides the user with the flexibility to construct any imaginable query straightaway.
With these columnar in-memory databases, OLTP and OLAP applications do not have to be separated into different systems anymore. Having the most recent transactional information at hand allows for users of Business Intelligence (BI) applications to run reports always on the most recent data. In traditional information systems, BI applications operate on pre-aggregated data sets that have been calculated in the past to quickly provide data for certain predefined reports. This traditional approach limits the amount of freedom for the user to change the granularity or degree of information. Given a report that shows the sales of a product per country, if the user would like to drill down into a country to investigate, for example, the sales per region, this is only possible if the information has been pre-aggregated. Therefore, he or she would have needed to know a long time ago that this is a report he or she may want to look at. With in-memory technology it is not necessary to store pre-aggregated results in the database anymore. Any query can be calculated on the fly. This will fundamentally change the way management reporting will be done and decisions will be made in the future. Even for enterprise level dataset sizes, in-memory column-oriented databases allow to execute typical analytical queries within under a second.
Despite these tremendous advantages over traditional disk-based database systems, technical problems, however, may still arise in columnar in-memory databases based on a main-delta architecture:
Calculating analytical queries on-the-fly in less than a second is achieved by massive parallelization of data processing. Each processor core works on a subset of the total amount of data tuples. Therefore, even for a single query execution, there is a high level of hardware utilization. This leads to slower query execution when many queries are run in parallel. Because queries can be calculated on-the-fly and thus can be used in an explorative manner, an even increased workload of OLAP queries containing aggregation operations is expected. Accordingly, when both large amounts of data and many parallel user accesses must be handled or when processing huge amounts of data with complex query and join criteria, aggregation calculation may still require several seconds or even minutes. This is longer than the typical human reaction time, which is in the order of several hundred milliseconds, so that the user will perceive the calculation of the query as waiting time. Therefore, the user's mind may even start wandering of to other topics, which is a process that cannot consciously be controlled.
Calculation of queries can be significantly accelerated by using materialized aggregates with pre-calculated data. However, this speed-up comes at the cost of aggregates maintenance which is necessary to guarantee consistency when the underlying data changes. The underlying data is also referred to as the “base table(s)”. While strategies for maintaining cached aggregates are well established in academia and industry for conventional disk-based database systems, none of these known strategies takes into account the particularities of columnar in-memory database systems having a main store and a differential buffer.
Strategies of how to perform view invalidation are focus of the work by K. S. Candan, D. Agrawal, C. Divyakant, O. Po, W. S. Li, L. Oliver, and W.-P. Hsiung, “View invalidation for dynamic content caching in multitiered architectures”, 2002. They do not provide a specific database architecture, which is why they their invalidation strategy is very complex and comes at high computational costs.
In “Maintenance of Data Cubes and Summary Tables in a Warehouse”, I. S. Mumick, D. Quass, and B. S. Mumick present a concept for efficiently updating a materialized view. This concept, however, relates to an OLAP-only system. Therein, the stored materialized views are updated at regular intervals, typically at night when maintenance downtime of the OLAP-only system is acceptable. Because the system is an OLAP-only system that updates its materialized aggregate only at regular time intervals, this system cannot guarantee to provide up-to-date results. When a user submits a request relating to the materialized view to the system, in most cases, the system will return a result that is already outdated because the data forming the basis of the materialized aggregate, which is typically stored in a separate OLTP system, has already changed. Further, this system requires storing those records that have been inserted or modified since the last update of the materialized aggregate in duplicate: once in the underlying OLTP system for the purpose of archiving them and once in a separate store for the purpose of updating the materialized aggregate.