With the continued proliferation of information sensing devices (e.g., mobile phones, online computers, RFID tags, sensors, etc.), increasingly larger volumes of data are collected for various business intelligence purposes. For example, the web browsing activities of online users are captured in various datasets (e.g., cookies, log files, etc.) for use by online advertisers in targeted advertising campaigns. Data from operational sources (e.g., point of sale systems, accounting systems, CRM systems, etc.) can also be combined with the data from online sources. Relying on traditional database structures (e.g., relational) to store such large volumes of data can result in database statements (e.g., queries) that are complex, resource-intensive, and time consuming. Deploying multidimensional database structures enables more complex database statements to be interpreted (e.g., executed) with substantially less overhead. Some such multidimensional models and analysis techniques (e.g., online analytical processing or OLAP) allow a user (e.g., business intelligence analyst) to view the data in “cubes” comprising multiple dimensions (e.g., product name, order month, etc.) and associated cells (e.g., defined by a combination of dimensions) holding a value that represents a measure (e.g., sale price, quantity, etc.). Further, with such large volumes of data from varying sources and with varying structures (e.g., relational, multidimensional, delimited flat file, document, etc.), the use of data warehouses and distributed file systems (e.g., Hadoop distributed file system or HDFS) to store and access data has increased. For example, an HDFS can be implemented for databases having a flat file structure with predetermined delimiters, and associated metadata (e.g., describing the keys for the respective delimited data values), to accommodate a broad range of data types and structures. Various query languages and query engines (e.g., Impala, SparkSQL, Tez, Drill, Presto, etc.) are available to users for querying data stored in data warehouses and/or distributed file systems.
Such distributed file systems, however, can be “append only” stores and can comprise fact tables with over a billion rows. Further, these stores are continually being modified (e.g., new rows appended) with new data, raising challenges related to data quality (e.g., “freshness”, accuracy, etc.). The users of such large datasets therefore desire to query the datasets with a high level of performance, characterized by fast query response times and accurate query results, across various query engines and data storage environments. Some legacy approaches for querying such large datasets can directly query the full dataset with available query languages and query engines. However, such queries can take minutes and sometimes hours to execute, not only lacking the desired fast query response times, but also expending costly computing resources and human resources. Other legacy approaches can store historical query results (e.g., multidimensional cell results) in a query cache for later use. While this approach can improve query performance when new queries are matched to the cached query results, this approach can be limited in query result quality (e.g., the most recent or fresh data are not in the cached results) and also limited in query response time (e.g., the new query does not match the cached results, the cached results have become large and time consuming to query, etc.). The aforementioned legacy approaches can further be limited in the ability to operate across a variety of query languages and query engines.
The problem to be solved is rooted in technological limitations of the legacy approaches. Improved techniques, and in particular, improved application of technology is needed to address the problem of fast and high quality querying of large datasets across a variety of data storage environments. More specifically, the technologies applied in the aforementioned legacy approaches fail to achieve the sought after capabilities of the herein disclosed techniques for dynamic aggregate generation and updating for high performance querying of large datasets, thus techniques are needed to improve the application and efficacy of various technologies as compared with the legacy approaches.