Currently, a representative big data query system (for example, Hive, Shark, or Impala) uses a single query statement as a base unit of parsing and optimization. A basic query processing procedure of the big data query system is as follows: first, a single read-in query statement is parsed into a logical query plan tree of a tree-like structure; then, an implementation algorithm is selected for each operator of the logical query plan tree, and an execution sequence of these operators is determined, so as to convert the logical query plan tree into a physical query plan; finally, a query execute engine executes the physical query plan and outputs a query result.
Key performance of the big data query system is query efficiency. Currently, a frequently used method for improving query efficiency of a big data query system is mainly to equivalently transform a logical query plan tree, so as to reduce a quantity of tasks in a physical query plan and to reduce an execution overhead of the tasks (which includes reducing read/write frequency of a file system, controlling an amount of data transmitted in a network and a calculation amount of a query operation, and the like). However, in a data warehouse (DWH) batch query scenario, a problem of insufficient optimization opportunities exists in a conventional processing mode in which a single query statement is used as a base unit of parsing and optimization. For example, task flow correlation optimization that is newly added to a Hive 0.12 version can bring acceleration for only three cases in total 22 query cases of a standard test set Transaction Processing Performance Council benchmark H (TPC-H), and has a specific requirement for a write manner of the cases.
Abundant inter-query optimization opportunities presented in the data warehouse batch query application scenario are in sharp contrast to insufficient intra-query optimization opportunities. The inter-query optimization opportunity is an optimization opportunity that exists between multiple query statements. In the batch query application scenario, a probability that similar query statements exist in the query statements is relatively high. Therefore, there is a large quantity of query optimization opportunities. However, in the prior art, query optimization is performed only on a single query statement. As a result, a big data query has low query efficiency.