At present, data processing can be roughly divided into two categories: on-line transaction processing (OLTP) and on-line analytical processing (OLAP). OLTP is mainly daily transaction processing, for example, bank transactions. The design of OLAP aims to satisfy specific query and reporting requirements in decision support or multidimensional environment. Numerous applications including OLAP drive the emergence and development of a data warehouse technology, and the data warehouse technology, in turn, promotes the development of the OLAP technology.
Input/output (I/O) is the biggest performance bottleneck in OLAP. When concurrent queries access a fact table on a disk separately, a large amount of random access produces a huge disk seek latency, greatly reducing the effective throughput of the disk. Currently, the mainstream technology of concurrent query processing is sharing I/O access of a fact table on a slow disk and eliminating the contention of different query processing tasks for disk access. In this process, the technical key is building a concurrent query processing cost model on the shared I/O and obtaining optimum load matching between an I/O latency and a concurrent query processing latency of cached data. However, a complex star-join operation exists in OLAP, so that the overall execution time of concurrent query processing is hard to predict due to different queries, and a unified concurrent query processing cost model cannot be obtained. In addition, in a conventional disk database, dimension tables and a temporary data structure such as a HASH table involved in the query processing also require disk access, which further degrades the disk I/O performance.
In the case of shared I/O, concurrent query processing faces three key technical challenges. The first challenge is migrating data required by dimension tables to a memory in the query processing so as to eliminate or reduce I/O contention due to fact table scan. The second challenge is designing OLAP query processing algorithms in an optimized way, researching a technology of predictable query processing with constant execution time for diversified queries of different selectivities, different numbers of dimension table joins, and different query parameters, and eliminating performance difference between different queries. The third challenge is building a reliable concurrent query processing cost model of shared I/O, setting a reasonable concurrent query load according to a database storage model (row store, column store) and disk I/O performance (disk, SSD, RAID), and optimizing system resources.
A representative solution (IBM BLINK) of the predictable query processing technology is pre-joining and compressing a dimension table and a fact table through denormalization, so as to convert a star-join operation in OLAP into bit operation-based filtering and aggregate processing on row compressed data, in which each record has the same filtering cost, thereby being capable of achieving query processing performance close to constant. The technical solution is applicable to a data warehouse in a completely read-only mode. However, for currently increasing operational OLAP processing, the cost of storage space of materialized data and cost of full data reconstruction caused by dimension table update affect the feasibility of the technical solution. In addition, reference integrity constraints between fact table records and dimension table records cause a large amount of repeated data in materialization of the dimension table, and a large amount of duplicated data corresponding to the same dimension table primary key requires a lot of duplicated predicate calculation in a materialized table, thereby reducing the central processing unit (CPU) efficiency.
Another representative technical solution of the predictable query processing technology is CJOIN, that is, converting a dimension table into a shared HASH filter and adding a concurrent query predicate result vector to each record in the HASH filter to mark query predicate expressions satisfied by the record. When a star-join operation in OLAP is performed, each record in a fact table is pushed into each HASH filter in turn, queries satisfying all predicate conditions are selected through an AND bit operation of a query bit vector, and a result set is distributed to an aggregator corresponding to each query, so as to complete group-by aggregate calculation. This technical solution requires generation of a public HASH table on each dimension table for a query group. Each query has different selectivity and group-by attribute, so the public HASH table contains a large number of dimensional attributes, the HASH table also has a lot of records, and the HASH table may even need to store all dimension table records. Such expansion of the public HASH table causes higher cost of HASH filtering (HASH-join), greater possibility of requiring disk exchange for the HASH table, degraded average performance of queries, and difficulty in predicting the performance of each HASH filter. When the query selectivity is low, a large amount of data needs to be transferred between the HASH filters in a group query, and data also needs to be transferred between the HASH filters even when final query bit vectors are all zero. However, actually only queries corresponding to non-zero positions in query bit vector results need to use all the data transferred between the HASH filters, resulting in great memory bandwidth waste.