Big data analytics is receiving significant interest from research and engineering organizations. The efforts from big data practitioners have helped advance the state-of-the-art in recent years. MapReduce (MR) (and Hadoop as the open source implementation of MR) has risen as the platform of choice for many of those practitioners as it offers scalability to large data sets, easy incorporation of new data sources, and the ability to query without lengthy schema definition processes. These approaches usually provide SQL or SQL-like (e.g., Hive) query APIs for easier programmability.
An evolutionary nature has been observed in how data analysts use analytical queries on big data. For example, an analyst wishes to identify top-1000 wine lovers in Northern California, who could be a good target customer set for a promotion campaign. Obviously, many aspects in this request are vague, e.g., how to define a top customer; which data sets are the most appropriate to do the analysis, etc. Therefore, the analyst typically starts with a best-effort query and issues a sequence of refined queries before getting the satisfactory result. This will be increasingly the case with the lower and lower cost of querying big data, which enables more people, including non-database-experts, to perform analytics on big data. Naturally, a data analytics system that selectively reuses the results and intermediate results produced by previous queries, referred to as opportunistic views, could deliver a much greater performance for such workloads. In Hadoop, each MR job involves the materialization of intermediate results for the purpose of failure recovery. A typical Hive query spawns a multi-stage job that will involve several of such materializations. These materialized results are the artifacts of query execution and generated automatically as a by-product of query processing, which is the reason they are called opportunistic views.