The amount of data that needs to be stored and processed is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. Today, many organizations need to load more than a terabyte of structured data per day into a data management system and have data warehouses more than a petabyte in size. This phenomenon is often referred to as “Big Data.”
Today, the most scalable database management systems that are typically used to handle the “Big Data” flood use a “shared-nothing” architecture, where data is partitioned across a potentially large cluster of cheap, independent commodity servers that are connected over a network. Queries and data processing jobs are sent to a single server in the cluster (depending on the particular implementation, this is either a “master node” or an arbitrary server in the cluster) which are optimized (when queries are expressed in declarative languages like SQL, there is more flexibility in the optimization process), and a plan generated for how the different machines in the cluster can execute the query or job in parallel. This plan is then distributed to the participating machines in the cluster, each of which process part of the query/job according to the plan.
The systems described by the above paragraph are often called “parallel database systems” or “massive parallel processing (MPP) database systems”. They achieve good performance by having each server involved in query processing running in parallel. This is typically achieved by leveraging the fact that data has been divided across the servers in advance of the processing. Therefore, each server can process the subset of the data that is stored locally. In some cases (such as filtering and local transformation operations), this processing can be done completely independently without the servers having to communicate. In other cases (such as aggregation and join operations), the servers need to communicate with each other over the network in order to complete the operation. In such a case, the optimizer that runs in advance of query/job execution makes an effort to create a plan that minimizes this network communication, since such communication can often become a bottleneck and limit the scalability of execution.
One important characteristic of these systems mentioned above is that planning and optimization is completed entirely in advance of query/job execution. This reduces the ability of such systems to adjust to unexpected events such as a server failure or server slow-down in the middle of query execution. In essence, these systems assume that failure is a rare enough event that queries and jobs can simply be restarted upon a server failure. Furthermore, they assume that it is possible to get homogenous or at least predictable performance across servers in the cluster, so that dynamic adjustment to “slowdown events” is unnecessary.
Unfortunately, while it is possible to get reasonably homogeneous or predictable performance across small numbers of machines (or “nodes”), it is nearly impossible to achieve this across hundreds or thousands of machines, even if each node runs on identical hardware or on an identically configured virtual machine. Part failures that do not cause complete node failure, but that result in degraded hardware performance, become more common at scale. Individual disk fragmentation and software configuration errors can also cause degraded performance on some nodes. Concurrent queries (or, in some cases, concurrent processes) further reduce the homogeneity of cluster performance. Furthermore, wild fluctuations in performance are common when running on virtual machines in cloud environments like Amazon EC2.
By doing query planning in advance, parallel database systems assign each node an amount of work to do based on the expected performance of that node. When running on small numbers of nodes, extreme outliers from expected performance are a rare event, and the lack of runtime task scheduling is not costly. At the scale of hundreds or thousands of nodes, extreme outliers are far more common, and query latency ends up being approximately equal to the time it takes these slow outliers to finish processing. Similarly, failures become statistically more common as the number of machines in the cluster increase, so the assumption that failures are a rare event becomes no longer valid.
Hence, as the size of the “Big Data” becomes larger, and more machines are needed in the shared-nothing cluster in order to process it, and parallel (MPP) database systems, using their current architecture, become increasingly poorly suited to handle these large scale processing tasks. One alternative is to use batch-processing systems such as Hadoop. While these systems are more fault tolerant and more adaptive than parallel database systems, their dynamic scheduling algorithms come at high overhead, leading to high job latencies (and hence the reputation of being “batch” systems rather than “real-time” systems).
Accordingly, what is needed is a parallel data processing system that combines fault tolerance, query adaptivity, and interactive query performance, and that overcomes the deficiencies of existing approaches.