Developments in computer and networking technology have given rise to applications that require massive amounts of data storage. For example, tens of millions of users can create web pages and upload images and text to a social media website. Consequently, a social media website can accumulate massive amounts of data each day and therefore need a highly scalable system for storing and processing data. Various tools exist to facilitate such mass data storage.
Frameworks exist that support large-scale data-intensive distributed applications, by enabling applications to interact with a cluster of thousands of computers (also referred to as nodes) and petabytes of data. For instance, a framework called Hadoop utilizes a distributed, scalable, portable file system, called Hadoop Distributed File System (HDFS), to distribute a massive amount of data among data nodes (also referred to as slave nodes) in a Hadoop cluster. In order to reduce the adverse impact of a data node power outage or network failure (including switch failure), data in an HDFS is typically replicated on different data nodes.
Hive, an open source data warehouse system, was developed to run on top of Hadoop clusters. Hive supports data queries expressed in a scripted query language (SQL)-like declarative language called HiveQL. The Hive system then compiles the queries expressed in HiveQL into map-reduce jobs that can be executed on the Hadoop cluster, in a mathematical form of directed acyclic graph. The HiveQL language includes a type system that supports tables containing primitive types, collections such as arrays and maps, and nested compositions of types. In addition, the Hive system includes a system catalog, called Hive Metastore, containing schemes and statistics, which is useful in data exploration and query optimization.
Coupled with the Hadoop cluster, the Hive system can store and analyze large amounts of data for a social networking system. For example, the Hive system can analyze the degree of connection between users to rank stories that users follow on the social networking system. The Hive system can analyze activity logs to gain insights into how services of the social networking system are being used to help application developers, page administrators and advertisers make development and business decisions. The Hive system can run complex data mining programs to optimize the advertisements shown to the users of the social networking system. The Hive system can further analyze the usage logs to identify spam and abuse of the social networking system.
The Hive system includes web-based tools for people without programming ability to author and execute Hive queries, for authoring, debugging and scheduling complex data pipelines, and for generating reports based on data stored in the Hive system and other relational databases like MySQL and Oracle.
However, query latency for the Hive system is usually high. Due to the large amount of data and the map-reduce scheme of the Hadoop cluster, even a simplest query can take from several seconds to minutes to complete. This is particular a problem for interactive analyses when an operator needs the result of the current query to decide the next query of a series of queries. The latency problem significantly affects the productivity of the analysts since the analysts cannot determine the next query when waiting for the result of the current query.
One possible workaround solution is to create data pipelines that load aggregate data from Hive into other type of relational database management system (RDBMS) such as MySQL and Oracle. Then the operator performs interactive analysis and builds reports using these RDBMS. However, each RDBMS needs a separate data pipeline. It also takes time for the data pipeline to transfer the aggregate data from Hive to other RDBMS. Thus, this workaround process is still cumbersome and inconvenient.