The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the background description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Traditional database systems feature a query execution engine that is tightly integrated with the underlying storage back-end, which typically consists of block-addressable persistent storage devices with no compute capabilities. These devices (hard disk drives and/or solid state drives) are characterized by (a) access times that differ significantly depending on whether the data is accessed sequentially or randomly, (b) access units that have a fixed minimum size, set at the granularity of a block, and (c) significantly slower (orders of magnitude) access time than main memory. These characteristics, along with the assumption that the storage back-end does not have any non-trivial compute capabilities have had an important impact on the design of database systems, from storage management to query execution to query optimization.
Databases originally served as operational stores managing the day-to-day activities of businesses. As database technology improved both in performance and cost, businesses saw a need to keep an increasing amount of operational history and business state for later analysis. Such analyses help businesses gain insight into their processes and optimize them, thereby providing a competitive advantage and increasing profit.
Data warehousing arose out of this need. Business data is often well-structured, fitting easily into relational tables. Data warehouses are essentially scalable relational database systems offering a structured query language (SQL) for offline analysis of this business data, and optimized for read-mostly workloads. For example, data warehouses include traditional systems like Teradata and newer vendors such as Vertica, Greenplum, and Aster Data. They provide a SQL interface, indexes, and fast columnar access.
Typically, data warehouses are loaded periodically, e.g., nightly or weekly, with data ingested from various sources and operational systems. The process of cleaning, curating, and unifying this data into a single schema and loading it into a warehouse is known as extract-transform-load (ETL). As the variety of sources and data increases, the complexity of the ETL process also increases. Successfully implementing ETL, including defining appropriate schemas and matching input data to the predetermined schemas, can take professionals weeks to months, and changes can be hard or impossible to implement. There are a number of tools, such as Abinitio, Informatica, and Pentaho, in the market to assist with the ETL process. However, the ETL process generally remains cumbersome, brittle, and expensive.
The data analytics market has exploded with a number of business intelligence and visualization tools that make it easy for business users to perform ad hoc, iterative analyses of data in warehouses. Business intelligence tools build multidimensional aggregates of warehouse data and allow users to navigate through and view various slices and projections of this data. For example, a business user might want to see total monthly sales by product category, region, and store. Then, they might want to dig deeper to weekly sales for specific categories or roll-up to see sales for the entire country. Multidimensional aggregates may also be referred to as online analytical processing (OLAP) cubes. A number of business intelligence (BI) tools, such as Business Objects and Cognos, enable such analyses, and support a language called Multidimensional Expressions (MDX) for querying cubes. There are also a number of visualization tools, such as MicroStrategy, Tableau, and Spotfire, that allow business users to intuitively navigate these cubes and data warehouses.
More recently, the type of data that businesses want to analyze has changed. As traditional brick and mortar businesses go online and new online businesses form, these businesses need to analyze the types of data that leading companies, such as Google and Yahoo, are inundated with. These include data types such as web pages, logs of page views, click streams, RSS (Rich Site Summary) feeds, application logs, application server logs, system logs, transaction logs, sensor data, social network feeds, news feeds, and blog posts.
These semi-structured data do not fit well into traditional warehouses. They have some inherent structure, but the structure may be inconsistent. The structure can change quickly over time and may vary across different sources. They are not naturally tabular, and the analyses that users want to run over these data—clustering, classification, prediction, and so on—are not easily expressed with SQL. The existing tools for making effective use of these data are cumbersome and insufficient.
As a result, a new highly scalable storage and analysis platform arose, Hadoop, inspired by the technologies implemented at Google for managing web crawls and searches. At its core, Hadoop offers a clustered file system for reliably storing its data, HDFS (Hadoop Distributed File System), and a rudimentary parallel analysis engine, MapReduce, to support more complex analyses. Starting with these pieces, the Hadoop ecosystem has grown to include an indexed, operational store, HBase, and new query interfaces, Pig and Hive, that rely on MapReduce.
Hive is an Apache project that adds a query layer on top of Hadoop, without any of the optimizations found in traditional warehouses for query optimization, caching, and indexing. Instead, Hive simply turns queries in a SQL-like language (called Hive-QL) into MapReduce jobs to be run against the Hadoop cluster. There are three main problems with Hive for traditional business users. Hive does not support standard SQL, and does not have a dynamic schema. Further, Hive is not fast enough to allow interactive queries, since each Hive query requires a MapReduce job that re-parses all the source data, and often requires multiple passes through the source data.
Impala is a real-time engine for Hive-QL queries on Cloudera's Hadoop implementation. It provides analysis over Hive's sequence files and may eventually support nested models. However, it does not have a dynamic schema, instead requiring that a user still provide a schema upfront for the data to be queried.
Pig is another Apache project and offers a schema-free scripting language for processing log files in Hadoop. Pig, like Hive, translates everything into map-reduce jobs. Likewise, it doesn't leverage any indexes, and is not fast enough for interactivity.
Jaql is a schema-free declarative language (in contrast to declarative languages, like SQL) for analyzing JavaScript Object Notation (JSON) logs. Like Pig, it compiles into map-reduce programs on Hadoop, and shares many of the same drawbacks, including a non-interactive speed.
Hadoop itself is catching on fairly quickly, and is readily available in the cloud. Amazon offers elastic map-reduce, which may be effectively equivalent to Hadoop's MapReduce implementation running in the cloud. It works on data stored in Amazon's cloud-based S3 (Simple Storage Service) and outputs results to S3.
The advantages of the Hadoop ecosystem are three fold. First, the system scales to extreme sizes and can store any data type. Second, it is extremely low cost compared to traditional warehouses (as much as twenty times less expensive). Third, it is open-source, which avoids lock-in with a single vendor. Users want the ability to pick the right tool for the right job and avoid moving data between systems to get their job done. Although Hadoop is more flexible, using Hadoop requires specially skilled administrators and programmers with deep knowledge, who are usually hard to find. Moreover, Hadoop is too slow to be interactive. Even the simplest queries take minutes to hours to execute.
Dremmel is a tool developed internally at Google, which provides SQL-based analysis queries over nested-relational or semi-structured data. The original version handled data in ProtoBuf format. Dremmel requires users to define the schema upfront for all records. BigQuery is a cloud-based commercialization of Dremmel and is extended to handle CSV and JSON formats. Drill is an open-source version of Dremmel.
Asterix is a system for managing and analyzing semi-structured data using an abstract data model (ADM), which is a generalization of JSON, and annotation query language (AQL). Asterix does not support standard SQL, nor does it have fast access afforded by the present disclosure.