A design of on-line analytical processing (OLAP) aims to satisfy specific query and reporting requirements in a decision support or multi-dimensional environment. A data warehouse generally adopts a multi-dimensional model to store subject-oriented analytical datasets, and mainly adopts a star-schema storage model having multiple dimension tables and a single fact table. The core of the OLAP query is star-join, that is, on the basis of the joining of the fact table and the multiple dimension tables, group-by aggregate calculation is performed on join results. The join operation between the fact table and the dimension tables mainly adopts the hash join technology. The key of the hash join technology lies in improving the storage efficiency of a hash table and the efficiency of hash detection, and reducing the latency of the hash join. In the OLAP, optimizing a join sequence of the fact table and multiple hash tables is a key technology of improving the performance of the OLAP query processing.
The dimension table is smaller than the fact table, so in the conventional typical technology, local OLAP processing on fact table fragments is supported by adopting dimension table full copy and fact table horizontal fragmentation technologies, and global reduce is performed on local OLAP aggregate results. The technical solution, on one hand, costs a large amount of dimension table redundant copy, and on the other hand, requires high cost of synchronous overhead for the update of the dimension table in a real-time OLAP application, which is hard to satisfy the requirements of the real-time OLAP.
In order to reduce network transmission cost of parallel join operation, in some database systems, collaborative partitioning (hash or range partitioning) of join key values of a fact table and dimension tables is adopted, so that corresponding primary-foreign key values in the fact table and the dimension tables joined thereto are stored in a distributed mode according to the same partition function, and therefore, tuples of the joins of the fact table and the dimension tables are allocated on the same node in advance, thereby reducing the network transmission cost during the join operation. However, compared with the multi-dimensional data model of the data warehouse, the partitioning performed according to multiple dimensions has a very low efficiency, it is difficult to realize the collaborative distribution on the star-join structure of the fact table and the multiple dimension tables, and the dimension table partitions distributed on different nodes also face huge cost for synchronization during update.
For a small dimension table and low selectivity, generally, dynamic data distribution is implemented by performing network broadcasting on sub-tables in the dimension table satisfying a condition or hash tables. However, in the OLAP query load, the selectivity on the dimension table is relatively high, and the network cost for the broadcasting is high. On the other hand, Hadoop is a software platform capable of performing distributed processing on massive data. HDFS (Hadoop distributed file system) is a corresponding distributed file system. The Hadoop defines Map and Reduce tasks for completing sub-tasks of the OLAP query processing. During the MapReduce star-join processing procedure, massive materialized data and data distribution will occupy a large amount of disk I/O and network bandwidth, which greatly affects the overall performance.
Improving the Hadoop performance is mainly embodied in two aspects: one is improving the local data processing performance, and the other is improving the network transmission performance. The local data processing performance includes I/O performance and CPU performance during the processing. In order to improve the I/O performance during the processing of the Hadoop platform, a column store model is introduced in the Hadoop platform.
In the Chinese Patent Application No. 201010546473.3, a Hadoop-based massive stream data storage and query method and a system are disclosed. The method includes the following steps: constructing a segment-level column clustered storage structure: storing stream data as column clustered records in turn, performing compression on the column clustered records, front to back, to obtain a compressed data page, writing the compressed data page into a piece of column clustered data, and additionally writing page summary information of the compressed data page to a tail of the column clustered data, so as to obtain a complete data segment; during the procedure of executing a query statement, according to a filtering condition, constructing a scan table by using the page summary information stored at the tail of the data segment so as to perform fast filtering on the data. Seen from the essential technology of the compression algorithm, the data compression technology involved in this patent and a compression technology adopted by a column store database have no essential difference, but only have different application fields.
For the optimization technology of the I/O performance, one solution is transplanting the mature column store compression technology to the Hadoop, so as to improve the storage efficiency and performance of the Hadoop; and the other solution is introducing the column store database, as a complete storage engine, into the Hadoop system to serve as an assistant storage engine, so as to improve the I/O performance from the perspective of system integration.
The column store technology adopts an access mode of one column at a time, and the OLAP query processing needs to generate a materialized join index or join result bitmap to indicate a position of data in the join column and satisfying the join condition. The row store technology adopts an access mode of one row at a time, and the OLAP query processing generally adopts a pipeline mode to eliminate the cost of materializing data between join tables, but needs to transfer local results of the join operation between pipelines. Therefore, performing multi-dimensional query join optimization on the basis of the storage model needs to better combine the I/O performance of the column store and the query processing efficiency of the row store, so as to further improve the local data processing performance through the query optimization technology.