With the continued proliferation of information sensing devices (e.g., mobile phones, online computers, RFID tags, sensors, etc.), increasingly larger volumes of data are collected for various business intelligence (BI) purposes. For example, the web browsing activities of online users are captured in various datasets (e.g., cookies, log files, etc.) that are used by online advertisers in targeted advertising campaigns. Data from operational sources (e.g., point of sale systems, accounting systems, CRM systems, etc.) can be combined with data from online sources. With such large volumes of data from varying sources and with varying structures (e.g., relational, multidimensional, delimited flat file, document, etc.), the use of data warehouses, distributed file systems (e.g., Hadoop distributed file system or HDFS), and/or other data storage environments to store and access data has increased. For example, an HDFS environment can be implemented for datasets having a flat file structure with predetermined delimiters and associated metadata.
Syntax and semantics for such metadata can be defined so as to accommodate a broad range of data types and structures. For example, metadata might describe keys that are used to access the delimited data values that comprise the datasets. The users (e.g., BI analysts) of such large and dynamic datasets desire to query the datasets in certain ways, using familiar capabilities that derive from a relatively small set of BI tools (e.g., Excel, Tableau, Qlik, etc.). In many cases, an enterprise makes capital and organizational investments in a particular BI tool (e.g., to obtain licenses, train staff, etc.). Commensurate with such capital and organizational investments, the enterprise would desire to use that tool to access any and all current and future data storage environments that may comprise data of interest to the enterprise.
Unfortunately, accessing data across multiple data storage environments from a particular BI tool is fraught with any number of challenges. For example, BI tools often employ standard protocols (e.g., XMLA, HS2, HTTP, etc.) to securely and reliably interact with (e.g., submit queries to) the various query engines (e.g., Impala, Spark SQL, Hive, Drill, Presto, etc.) associated with respective data storage environments (e.g., an HDFS environment, a relational database management system or RDMS environment, a SQL data warehouse environment, etc.). However, such standard protocols are limited in their functionality, at least in that certain of such standard protocols (e.g., XMLA) might merely facilitate sending a query (e.g., issuing data statements from a BI tool to a storage environment) and receiving the query results (e.g., at the BI tool). During the performance of the protocol, user activity is suspended (e.g., blocked) between the time of sending the query and the time of receiving the results. The duration of the suspension while the user is waiting can be several seconds, or minutes or longer for successively larger datasets. Wait times of this magnitude detract from the user experience. Moreover, a user might not know how long of a wait time to expect, resulting in further degradation of the user experience.
Other protocols (e.g., HS2) might facilitate polling for a high order query status (e.g., in process, complete, failed, etc.) and informing the user of such status, but still leaves the user waiting for an unknown period of time until query completion (or failure). To provide more information to the user of the BI tool, some approaches rely on a custom protocol that extends the capabilities of the standard protocols. However, such custom solutions involve implementation of certain components (e.g., custom application programming interfaces or APIs and/or custom user interfaces or UIs, etc.), the implementation of which often covers the entire software stack to facilitate operation of such custom solutions.
For example, a custom solution from each of the many query engines across multiple data storage environments would involve implementing many respective APIs and UIs in the BI tools. Such custom implementations in the BI tools and/or other points the software stack might be difficult or impossible to obtain. What is needed is an environment-independent and tool-independent technological solution that facilitates enhanced data statement (e.g., query) management (e.g., monitoring, control, etc.) by users of BI tools. More specifically, what is needed is fine-grained management of data statements issued from BI tools to multiple heterogeneous data storage environments without modifying the BI tools.
What is needed is a technique or techniques to improve over legacy techniques and/or over other considered approaches. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.