Developments in computer and networking technology have given rise to applications that require massive amounts of data storage. For example, tens of millions of users can create web pages and upload images and text to a social media website. Consequently, a social media website can accumulate massive amounts of data each day and therefore need a highly scalable system for storing and processing data. Various tools exist to facilitate such mass data storage.
Front end clusters of these social media website monitor user activities and produce log data based on the activities of social media users. The front end clusters transmit the log data to a centralized storage filer or a data warehouse. The centralized storage filer or data warehouse organizes the received log data and responds to requests from data processing applications. In order to accommodate the massive amounts of log data, large-scale data warehouses are commonly used to store the log data and service the data-intensive inquiries from the data processing applications.
Frameworks exist that support large-scale data-intensive distributed applications, by enabling applications to interact with a cluster of thousands of computers (also referred to as nodes) and petabytes of data. For instance, a framework called Hadoop utilizes a distributed, scalable, portable file system, called Hadoop Distributed File System (HDFS), to distribute a massive amount of data among data nodes (also referred to as slave nodes) in a Hadoop cluster. In order to reduce the adverse impact of a data node power outage or network failure (including switch failure), data in an HDFS is typically replicated on different data nodes.
Hive, an open source data warehouse system, was developed to run on top of Hadoop clusters. Hive supports data queries expressed in a scripted query language (SQL)-like declarative language called HiveQL. The Hive system then compiles the queries expressed in HiveQL into map-reduce jobs that can be executed on the Hadoop cluster, in a mathematical form of directed acyclic graph. The HiveQL language includes a type system that supports tables containing primitive types, collections such as arrays and maps, and nested compositions of types. In addition, the Hive system includes a system catalog, called Hive-Metastore, containing schemes and statistics, which is useful in data exploration and query optimization.
Coupled with the Hadoop cluster, the Hive system can store and analyze large amounts of data for a social networking system. For example, the Hive system can analyze the degree of connection between users to rank stories that users follow on the social networking system. The Hive system can analyze activity logs to gain insights into how services of the social networking system are being used to help application developers, page administrators and advertisers make development and business decisions. The Hive system can run complex data mining programs to optimize the advertisements shown to the users of the social networking system. The Hive system can further analyze the usage logs to identify spam and abuse of the social networking system.
The Hive system includes web-based tools for people without programming ability to author and execute Hive queries, for authoring, debugging and scheduling complex data pipelines, and for generating reports based on data stored in the Hive system and other relational databases like MySQL and Oracle.
However, the front end clusters sends the captured log data to the centralized data warehouse periodically, instead of in real time. Furthermore, it takes time for the data warehouse to organize the received log data before the data warehouse is able to respond to data inquiries for these log data. Therefore, the log data in the data warehouse is only available after a time period since the log data was captured. The time period can be an hour or even a day. The data processing and consuming applications can only access the log data with a significant latency.
Furthermore, the centralized data warehouse needs to maintain connections with the front end servers for continuously receiving the log data. In a modern social network, the number of front end servers can be thousands or even more. The data warehouse carries a significant burden of maintaining the connections. Such a burden of maintaining the connections impacts the overall performance of the data warehouse.