Distributed query processing refers to the task of executing a database query across a collection of distinct processing nodes, such as computer systems connected via a computer network. One framework for implementing distributed query processing is known as a parallel data warehouse. In a typical parallel data warehouse, each processing node is a dedicated, high-end server that maintains a local storage component. The local storage component stores a subset, or partition, of the data in the parallel data warehouse. When the parallel data warehouse receives a database query, an optimizer of the parallel data warehouse translates the query into a tree of relational query operators referred to as a query plan. The optimizer then divides the query plan into sections and schedules each section for execution on one or more of the processing nodes. At query runtime, data flows from the database tables (as stored in the server-specific storage components) and between the various processing nodes according to the hierarchical structure of the query plan. The processing node(s) executing the top-most (i.e., root) section of the query plan then output the final result of the query.
Parallel data warehouses generally provide better performance than other existing distributed query processing solutions, and thus are widely used for data-intensive and complex query workloads. However, parallel data warehouses also suffer from a number of limitations. For example, since a typical parallel data warehouse is composed of high-end servers, the hardware costs for deploying and maintaining such a warehouse are relatively high. Further, due to the manner in which data flows directly from one processing node to another, parallel data warehouses are not robust against processing node failures; if any of the processing nodes fail during a long running query, the entire query must be completely restarted. Yet further, the optimizer of a parallel data warehouse only schedules the execution of query plan sections on processing nodes prior to starting a query; once query execution begins, this scheduling cannot be changed. As a result, there is no way to dynamically reallocate work among the scheduled processing nodes (or to new processing nodes) to account for, e.g., data skew in the input data, fluctuating compute/memory resources, and/or other runtime factors that may unexpectedly slow down query processing.
Another framework for implementing distributed query processing is known as the MapReduce (MR) model. Unlike a parallel data warehouse, an MR system generally utilizes a large number of heterogeneous, relatively inexpensive processing nodes (e.g., desktop computers) that share access to a single, distributed cloud of data (e.g., a distributed file system). When an MR system receives a database query, the system decomposes the query into a series of jobs referred to as “map” or “reduce” jobs. Each job is allocated to one or more of the processing nodes. To enable data passing, each processing node writes the results of its job as a set of files to the cloud. These files are then read from the cloud by other processing nodes whose jobs are dependent on the previous job. This process continues until all of the jobs have completed, at which point a final query result is available.
Since MR systems can be implemented with commodity machines, such systems are attractive from a cost perspective and can leverage existing computing infrastructures. In addition, MR systems handle processing node failures more gracefully than parallel data warehouses. For example, if a particular processing node in an MR cluster fails, there is no need to roll back the jobs completed by other processing nodes in the cluster because the results generated by those processing nodes are saved in the cloud; the only work that needs to be restarted is the specific job assigned to the failed processing node. On the other hand, one disadvantage of the MR model is that the process of saving job results imposes a performance penalty due to disk I/O; this penalty can be very significant if the result size for a particular job is large. Moreover, MR systems cannot dynamically reschedule jobs on different processing nodes once query execution begins, and thus suffer from the same limitations as parallel data warehouses when encountering data skew, fluctuating compute/memory resources, and other similar conditions at query runtime.