Large scale data processing involves extraction of data of interest from a set of raw data located in one or more databases and then processing the extracted data into desired form. Extraction is accomplished through requests, or queries, which are executed on the stored data. Because significant portions of data are extracted from large of amounts of data stored and subsequently formatted, such queries are typically complex and require development of a particular execution plan as well as considerable processing time. Many conventional techniques have been known to process such queries. Some of these include a MapReduce technique.
In order to execute a complex data warehousing query in a MapReduce-based system, such query needs to be translated into a series of MapReduce (“MR”) jobs. Since each MR job typically involves many input/output (“I/O”) operations and network transfers, an efficient query execution plan is typically composed of as few MR jobs as possible. Moreover, each job attempts to minimize the amount of data written to disk or sent over network.
Some conventional systems employ MapReduce as an extremely popular framework for performing scalable parallel advanced analytics and data mining. Despite the fact that there is nothing fundamentally new about the technology, the availability of a free and open source implementation (see, e.g., http://hadoop.apache.org (“Hadoop”)), along with its heavy utilization and evangelization by two of the largest Web companies in the world (Google and Yahoo), stellar performance on extreme-scale benchmarks, and impressive ease-of-use experience, has lead to its rapid adoption for many different kinds of data analysis and data processing.
Historically, main applications of the MapReduce framework were in the Web indexing, text analytics, and graph data mining areas. However, as MapReduce continues its steady progression towards becoming the de facto data analysis standard, it has started to be used for structured data analysis tasks traditionally dominated by relational databases in data warehouse deployments. Even though there are many who argue that MapReduce is not optimal for structured data analysis tasks, it is nonetheless being used increasingly frequently for these tasks due to the desire to unify the data management platform. Thus, the standard, structured data analysis can proceed side-by-side with the complex analytics that MapReduce is well-suited for, along with the superior scalability of MapReduce and lower price. For example, Facebook famously ran a proof of concept comparing multiple parallel relational database vendors before deciding to run their 2.5 petabyte clickstream data warehouse using Hadoop instead.
Consequently, there has been a significant amount of research and commercial activity in recent years with the goal of integrating MapReduce and relational database technology. This activity can be divided into two main directions: (1) starting with a parallel database system and adding MapReduce technology (or at least a MapReduce interface), and (2) starting with MapReduce (typically the Hadoop implementation) and adding database system technology.
However, there have been many performance problems with Hadoop systems when applied to structured data because of an unoptimal storage layer. The default Hadoop system's storage layer is the Hadoop distributed file system (“HDFS”). Hadoop has been implemented with an open-source data warehousing infrastructure that has been built on top of Hadoop (see, e.g., http://hadoop.apache.org/hive (“Hive”)). Facebook, which was the creator and main user of Hive, is currently managing an over 700 TB dataset (before replication), with 5 TB added daily. Over 7500 requests (or jobs) are submitted each day to analyze more than 75 TB of compressed data. Hive provides tools that enable data summarization, adhoc querying and analysis of detail data as well as a mechanism to impose structure on the data. In addition, it also provides a simple query language called QL or HiveQL, which is based on SQL and enables users familiar with SQL to do adhoc querying, summarization and data analysis. At the same time, this language also allows traditional MapReduce programmers to plug in their custom mappers and reducers more sophisticated analysis capabilities which may not be supported by the built-in capabilities of the language. Hive accepts queries expressed in HiveQL and executes them against data stored in HDFS. The relational mapping over the data is maintained in a system catalog called Metastore.
Hive's query compiler processes HiveQL statements in a series of steps. First, query parsing and validation against metadata (table definitions and data types) is performed. Next, the resulting operator DAG is transformed by the optimizer. Hive supports the following rule-based transformations: column pruning, predicate pushdown, join reordering, and partition pruning. After optimization, the logical query plan is translated into a physical plan—a series of MapReduce jobs and HDFS tasks. Hive's query executor coordinates the execution of each stage of the query plan. Custom operations, such as map side joins, hash-based partial aggregations, and repartitioned group by to handle skew are applied during runtime when appropriate. Intermediate data are stored into HDFS as temporary tables. A major limitation of the Hive data warehouse is its data storage layer. By employing a distributed file system, Hive is unable to utilize hash-partitioning and collocation of related tables, a typical strategy parallel databases exploit to minimize data movement across nodes. Moreover, Hive workloads are very I/O heavy due to lack of indexing. Furthermore, the system catalog lacks statistics on data distribution and Hive's optimizer is quite unsophisticated because no cost-based or adaptive algorithms were applied.
Thus, there is a need for a more efficient data processing systems and methods for obtaining large-size data from even bigger data sets stored in databases through execution of requests or queries.