Today, the amount of data that needs to be stored and processed by analytical database systems is exploding. This is partly due to the increased automation with which data can be produced (more business processes are becoming digitized), the proliferation of sensors and data-producing devices, Web-scale interactions with customers, and government compliance demands along with strategic corporate initiatives requiring more historical data to be kept online for analysis. It is no longer uncommon to hear of companies claiming to load more than a terabyte of structured data per day into their analytical database system and claiming data warehouses of size more than a petabyte.
Given the exploding data problem, conventional analytical database systems use database management software (“DBMS”) on a shared-nothing architecture (a collection of independent, possibly virtual, machines, each with local disk and local main memory, connected together on a high-speed network). This architecture can scale the best, especially if one takes hardware cost into account. Furthermore, data analysis workloads tend to consist of many large scan operations, multidimensional aggregations, and star schema joins, all of which are easy to parallelize across nodes in a shared-nothing network.
Parallel databases have been proven to scale well into the tens of nodes (near linear scalability is not uncommon). However, there are few known parallel database deployments consisting of more than one hundred nodes and there does not appear to exist a parallel database with nodes numbering into the thousands. There are a variety of reasons why parallel databases generally do not scale well into the hundreds of nodes. First, failures become increasingly common as one adds more nodes to a system, yet parallel databases tend to be designed with the assumption that failures are a rare event. Second, parallel databases generally assume a homogeneous array of machines, yet it is nearly impossible to achieve pure homogeneity at such scale. Third, until recently, there have only been a handful of applications that required deployment on more than a few dozen nodes for reasonable performance, so parallel databases have not been tested at larger scales, and unforeseen engineering hurdles may await.
As the data that needs to be analyzed continues to grow, the number of applications that require more than one hundred nodes is multiplying. Conventional MapReduce-based systems used by Google, Inc., for example, perform data analysis at this scale since they were designed from the beginning to scale to thousands of nodes in a shared-nothing architecture. MapReduce (or one of its publicly available versions such as open source Hadoop (see, e.g., hadoop.apache.org/core)) are used to process structured data, and do so at a tremendous scale. For example, Hadoop is being used to manage Facebook's 2.5 petabyte (1015 bytes) data warehouse.
MapReduce processes data that may be distributed (and replicated) across many nodes in a shared-nothing cluster via three basic operations. The MapReduce processing is performed as follows. First, a set of Map tasks are processed in parallel by each node in the cluster without communicating with other nodes. Next, data is repartitioned across all nodes of the cluster. Finally, a set of Reduce tasks are executed in parallel by each node on the partition it receives. This can be followed by an arbitrary number of additional Map-repartition-Reduce cycles as necessary. MapReduce does not create a detailed query execution plan that specifies which nodes will run which tasks in advance; instead, this is determined at runtime. This allows MapReduce to adjust to node failures and slow nodes on the fly by assigning more tasks to faster nodes and reassigning tasks from failed nodes. MapReduce also checkpoints the output of each Map task to local disk in order to minimize the amount of work that has to be redone upon a failure.
MapReduce has a good fault tolerance and an ability to operate in a heterogeneous environment. It achieves fault tolerance by detecting and reassigning Map tasks of failed nodes to other nodes in the cluster (preferably, nodes with replicas of the input Map data). It achieves the ability to operate in a heterogeneous environment via redundant task execution. Tasks that are taking a long time to complete on slow nodes get redundantly executed on other nodes that have completed their assigned tasks. The time to complete the task becomes equal to the time for the fastest node to complete the redundantly executed task. By breaking tasks into small, granular tasks, the effect of faults and “straggler” nodes can be minimized.
MapReduce has a flexible query interface: Map and Reduce functions are just arbitrary computations written in a general-purpose language. Therefore, it is possible for each task to do anything on its input, just as long as its output follows the conventions defined by the model. In general, most MapReduce-based systems (such as Hadoop, which directly implements the systems-level details of MapReduce) do not accept declarative SQL.
One of issues with MapReduce is performance. By not requiring the user to first model and load data before processing, many of the performance enhancing tools listed below that are used by database systems are not possible. Traditional business data analytical processing, that have standard reports and many repeated queries, is particularly, poorly suited for the one-time query processing model of MapReduce. MapReduce lacks many of the features that have proven invaluable for structured data analysis workloads and its immediate gratification paradigm precludes some of the long term benefits of first modeling and loading data before processing. These shortcomings can cause an order of magnitude slower performance than parallel databases.
Parallel database systems support standard relational tables and SQL, and implement many of the performance enhancing techniques, including indexing, compression (and direct operation on compressed data), materialized views, result caching, and I/O sharing. Most (or even all) tables are partitioned over multiple nodes in a shared-nothing cluster; however, the mechanism by which data is partitioned is transparent to the end-user. Parallel databases use an optimizer tailored for distributed workloads that turn SQL commands into a query plan whose execution is divided equally among multiple nodes. Parallel database systems can achieve high performance when administered by a highly skilled database administrator (“DBA”), who can carefully design, deploy, tune, and maintain the system. However, recent advances in automating these tasks and bundling the software into appliance (pre-tuned and pre-configured) offerings have made these systems less effective. Many conventional parallel database systems allow user defined functions (“UDFs”), although the ability for the query planner and optimizer to parallelize UDFs well over a shared-nothing cluster varies across different implementations.
Parallel database systems generally have a lower fault tolerance and do not operate well in a heterogeneous environment. Although particular details of parallel database systems implementations vary, their historical assumptions that failures are rare events and “large” clusters mean dozens of nodes (instead of hundreds or thousands) have resulted in engineering decisions that make it difficult to achieve these properties. Further, in some cases, there is a clear tradeoff between fault tolerance and performance, and parallel database systems tend to choose the performance extreme of these tradeoffs. For example, frequent check-pointing of completed sub-tasks increase the fault tolerance of long-running read queries, yet this check-pointing reduces performance. In addition, pipelining intermediate results between query operators can improve performance, but can result in a large amount of work being lost upon a failure. Hence, conventional parallel database systems present an inefficient solution to large data analysis.
Thus, there is a need to combine the scalability characteristics of MapReduce and performance and efficiency of parallel databases to achieve a system that can effectively handle data-intensive applications. There is further a need for a data processing system capable of efficiently processing large amounts of data by directing query-processing into higher-performing single-node databases.