The design and implementation of modern data storage environments are driven by increasing volume, velocity, and variety of information assets (e.g., data). Although all three components of data management are growing, variety often has the most influence on data storage investment and/or implementation decisions. As an example, an enterprise might desire to have access to 100 TB or more of data that comprises some datasets stored in a variety of modern heterogeneous data storage environments (e.g., Hadoop distributed file system or HDFS), as well as some other datasets stored in a variety of legacy data storage environments (e.g., relational database management systems or RDBMS). Another aspect of variety pertains to the structure of the data (e.g., data type) comprising the datasets. Datasets are represented in various structures or formats ranging from schema-free JSON datasets, to delimited flat file datasets, to non-flat datasets (e.g., Avro, Parquet, XML, etc.), to nested data types within other databases (e.g., relational databases, NoSQL databases, etc.). The variety of data types is continually increasing.
The existence of such a wide range of data organization and/or storage implementations has given rise to the development of specialized query engines that are developed to serve a particular respective data type and/or data storage environment. These query engines are architected to efficiently manipulate data (and associated metadata) of a particular representation, and/or to efficiently store and retrieve data within a particular data storage environment. Such query engines can support distinctly different functional capabilities and distinctly different performance characteristics. In some cases, multiple query engines are available for a particular data storage environment and data type combination. In some cases, specialized query engines are tuned for a particular commercial use. As examples, multiple query engines (e.g., Impala, Hive, Spark, Presto, Drill, Pig, etc.) might be available to query datasets in a “big data” environment such as HDFS.
Unfortunately, given the panoply of available query engines, identifying which query engine to use for certain data statements (e.g., comprising a data query) is fraught with challenges. Further, developing data statements that are formatted for each identified query engine so as to take advantage of that query engine's capabilities can also present challenges. One legacy approach to addressing such challenges is to determine a priori a target query engine for a particular set of data statements. The data statements are then structured for the target query engine to efficiently operate over a subject dataset. With this approach, however, the data statements might not perform efficiently (or at all) on query engines other than the query engine for which the data statements had been structured.
Conditions that might demand consideration and selection of an alternate query engine can arise from a wide range of causes. For example, a need to select an alternate query engine can result due to a temporary outage of the target query engine (e.g., the query engine server is down), or due to a migration of the underlying dataset to another environment served by a different query engine, or due to the availability of a new query engine with enhanced capabilities that are accessed with new syntax (e.g., instructions, statements, hints, etc.), and so on. In any of such cases, the original data statements that were formulated for the original target query engine might not perform as intended, or might not perform at all. Further, the user (e.g., business intelligence analyst) and/or system issuing the data statements might not be aware of some or all of the alternate query engines available at the moment in time the original data statements are being formulated. Legacy approaches where alternate query engines are considered or reconsidered each time data statements are invoked wastes a significant amount of human effort and wastes a significant amount of computing, storage, networking, and other resources. What is needed is a technological solution that facilitates efficient identification and use of query engines that are available to process data operations on datasets stored in multiple heterogeneous environments.
Therefore, what is needed is a technique or techniques to improve over legacy techniques and/or over other considered approaches that address efficiently identifying and using query engines for data operations on a variety of datasets stored in heterogeneous data storage environments. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.