Data processing frameworks are available for distributed storage and processing of large sets of data on commodity hardware. One such data processing framework is Apache Hadoop™ which is an open source framework for distributed storage and processing of large sets of data on commodity hardware. The Hadoop framework can be used to process large amounts of data in a massively parallel or other distributed manner thereby enabling businesses to quickly gain insight from massive amounts of both structured as well as unstructured data.
The Hadoop framework is enterprise ready in part because it provides for the storage and processing of vast quantities of data in a storage layer that scales linearly. In this regard, Hadoop Distributed File System (HDFS) is a technology providing for efficient scale out of a storage layer. Environments such as HDFS provide a fault-tolerant environment which is designed to be deployed within a distributed computing infrastructure, using relatively low-cost commodity hardware. Such an environment provides high throughput access to application data, and is particularly suitable for applications that have very large data sets (e.g., machine learning).
Many specialized engines are available for enabling interaction with the data in a wide variety of ways including batch access, real-time access, and combinations of batch and real-time access. Apache Hive™ is the most widely adopted technology for accessing massive amounts of data such as might be organized and stored in Hadoop and is, essentially, a data warehouse having tables similar to tables in a relational database. Engines such as Hive enable easy data summarization and ad-hoc/interactive queries via a structured query language (SQL) like interface for large datasets (e.g., petabytes of data) stored in HDFS.
Table and storage management interface layers such as the HCatalog interface enable users with different data processing tools to more easily read and write data relative to the Hadoop environment. HCatalog uses Hive's command line interface for issuing commands to define data and to explore metadata.
Latency for Hive queries, however, is generally very high even for relatively small data sets owing in part to the batch processing of Hadoop jobs which can at times incur substantial overheads in job submission and scheduling. In addition, tools such as HIVE that use a database abstraction layer such as HCatalog can divide a query into multiple pieces and execute them separately against a database. However, these queries are not executed atomically but instead are executed at different points in time. As such, the results of each query when combined could violate the read-consistent rule relative to database retrieval protocol rules.