MapReduce framework has gotten wide interests due to its simple programming model and good scalability across thousands of commodity machines. It allows users to write their own Map and Reduce functions and provides general framework to parallelize such functions. To enable fine-grained fault tolerance, all intermediate data from Map to Reduce are stored in disk. Such characteristic distinguishes traditional database systems which pipeline intermediate data from one operator to another. Hadoop is one implementation of MapReduce model and widely used by the community.
As Map and Reduce functions are written in low-level languages, such as Java in Hadoop, it is troublesome for developers who are not familiar with those languages and difficult to reuse the functions. To simplify the expression of MR programs, many systems provide the support of high-level language (e.g. SQL query) on top of MapReduce framework, such as Hive, Pig, Impala, and Presto. We call them SQL-on-Hadoop systems. Such systems convert an SQL-like query into a DAG (Directed Acyclic Graph) of MapReduce jobs which are then executed by a MapReduce system such as Hadoop one by one.
Some SQL-on-Hadoop systems are targeting to batch processing of queries. Since different queries often perform similar works, much redundant computations are conducted when a batch of queries are executed by the system. For example, multiple MR jobs read the same input file and they can share data scanning to reduce I/O cost. In addition to sharing data scanning, there are also other sharing opportunities among different jobs, such as sharing map outputs and map functions. Thus, it is useful to apply the idea of multiple query optimization to optimize the processing of multiple jobs by avoiding redundant computation in such systems. Multi-query optimization is to group multiple jobs into a single MR job to reduce the overall computation time of all jobs.
There are some existing works proposing some sharing techniques for a batch of MR jobs. They built cost models to estimate whether a system can gain from grouping a batch of MR jobs. Almost all of their cost models only consider sharing of input scan and map output because they think I/O cost dominates the total computational time. They didn't take the sharing of map function into consideration, which makes the model may not be accurate in SQL-on-Hadoop systems where map functions composed of a DAG of relational operators are heavy and cannot be ignored. In addition, a multiple query optimizer in Hive is proposed recently. That work uses rule-based method to rewrite multiple queries into a single insert query which can be executed by Hive. It takes advantages of sharing scan and map function, but the rules for sharing is too simple. They only consider sharing the same join operation and think such sharing is always beneficial.
Therefore, our invention targets to provide a method to optimize multiple queries processed in SQL-on-Hadoop systems. Each map function in SQL-on-Hadoop systems is a partial of overall query plan which is composed of a sequence of several relational operators and the order of some operators can be exchanged. Based on this characteristic, the proposed method defines the problem of optimizing multiple MR jobs as finding optimal groups of MR jobs with optimal query plans within each group. A cost model for overall computational time of a batch of MR jobs is created with consideration of both I/O cost for reading/writing/sorting data and CPU cost for processing the DAGs of relational operators in map functions. To find an optimal integrated query plan within each group, the method generates some rules to reduce search spaces based on the feature that each query plan is locally optimal. Then a greedy algorithm is used to group multiple jobs. In this case, by applying a more accurate cost model, more suitable jobs can be aggregated into a single one and the overall computation time is reduced.