APACHE HADOOP™ project (hereinafter “HADOOP™”) is an open-source software framework for developing software for reliable, scalable and distributed processing of large data sets across clusters of commodity machines. HADOOP™ includes a distributed file system, known as HADOOP DISTRIBUTED FILE SYSTEM (HDFS™). HDFS™ links together the file systems on local nodes to form a unified file system that spans the entire HADOOP™ cluster. HADOOP™ also includes HADOOP™ YARN that provides a framework for job scheduling and cluster resource management that is utilized by a programming framework known as MapReduce. HADOOP™ is also supplemented by other Apache projects including APACHE HIVE™ (hereinafter “HIVE™”) and APACHE HBASE™ (hereinafter “HBASE™”). HIVE™ is a data warehouse infrastructure that provides data summarization and ad hoc querying. HBASE™ is a scalable, distributed NoSQL (No Structured Query Language) database or data store that supports structured data storage for large tables.
MapReduce processes data in parallel by mapping or dividing a work into smaller sub-problems and assigning them to worker nodes in a cluster. The worker nodes process the sub-problems and return the results, which are combined to “reduce” to an output that is passed on a solution. MapReduce is a batch processing framework, and is optimized for processing large amount of data in parallel by distributing the workload across different machines. MapReduce offers advantages including fault tolerance, but also suffers from severe disadvantages such as high latency.
The latency in MapReduce is a result of its batch oriented map/reduce model. In MapReduce, during an execution, the output of the “map” phase serves as the input for the “reduce” phase, such that the “reduce” phase cannot be completed before the “map” phase of execution is complete. Furthermore, all the intermediate data is stored on the disc before download to the reducer. Because of the above reasons, MapReduce adds latency which can cause a simple query started through MapReduce to take a long time to execute.
HIVE™ is a framework that lies on top of MapReduce. HIVE™ translates a language that looks like Structured Query Language (SQL) to MapReduce code, making data access in a HADOOP™ cluster much easier for users. HIVE™, however, still uses MapReduce as its execution engine, under the covers, and inherits all the disadvantages of MapReduce. Due to this, simple HIVE™ queries can take a long time to execute.