Handling huge volumes of data on a daily basis is a task that most organizations have to deal with. Such organizations have been storing huge volumes of data for decades, but now with the availability of new techniques for analyzing those huge data sets, organizations seek to improve their operational efficiency. Data sets today aren't merely larger than the older data sets, but also significantly more complex, for example, unstructured and semi-structured data generated by sensors, web logs, social media, mobile communication, and customer service records.
There are many software frameworks to store and analyze large volumes of data in a massively parallel scale. Apache Hadoop is an example and often cited in many journals, publications, blogs, and other technical articles for massively parallel processing system. It is now known to be the de-facto technology platform for supporting storage of massive amounts of heterogeneous data and processing them.
The Hadoop Distributed File System (HDFS) for data storage and its specialized distributed programming model ‘MapReduce’ for data processing, across relatively inexpensive commodity hardware, may be leveraged for mixing and matching data from many disparate sources and reveal meaningful insights.
However, Hadoop as a technology has several limitations. First, organizations are interested in ‘interactive analytics’, a solution requiring faster time-to-insight when compared to the time it takes for a MapReduce job to execute and provide the required results. Second, the ability to enable analysts and data scientists to directly interact with any data stored in Hadoop, using their existing business intelligence (BI) tools and skills through a well-accepted SQL interface. Apache Hive, however, facilitates querying the data using an SQL-like language called HiveQL, but it is much slower than what the industry demands in terms of interactive querying.
There are several massively parallel query processing (MPQP) tools available in the market that enable organizations to perform interactive SQL-like querying on massive data-sets on the Hadoop platform, called SQL-on-Hadoop tools. However, each of these tools is optimized to perform efficiently for a certain class of queries only, operating on a certain known data type and format on a well-defined hardware and software configuration. The data model and the storage model have to be optimized significantly in order to obtain faster query response times.
To add to the problem, the technological landscape of massively parallel query processing frameworks is large and it becomes increasingly difficult for organizations to evaluate each of these tools for the different kinds of queries they have for processing, operating on varying data-sets, (for example, queries from marketing, analysts, engineers, and senior management).