With the continued proliferation of information sensing devices (e.g., mobile phones, online computers, RFID tags, sensors, etc.), increasingly larger volumes of data are collected for various business intelligence purposes. For example, the web browsing activities of online users are captured in various datasets (e.g., cookies, log files, etc.) for use by online advertisers in targeted advertising campaigns. Data from operational sources (e.g., point of sale systems, accounting systems, CRM systems, etc.) can also be combined with the data from online sources. Relying on traditional database structures (e.g., relational) to store such large volumes of data can result in database statements (e.g., queries) that are complex, resource-intensive, and time consuming. Deploying multidimensional database structures enables more complex database statements to be interpreted (e.g., executed) with substantially less overhead. Some such multidimensional models and/or analysis techniques (e.g., online analytical processing or OLAP) can enable a user (e.g., business intelligence analyst) to view the data in “cubes” comprising multiple dimensions (e.g., product name, order month, etc.) and associated cells (e.g., defined by a combination of dimensions) holding a value that represents a measure (e.g., sale price, quantity, etc.). Further, with such large volumes of data from varying sources and with varying structures (e.g., relational, multidimensional, delimited flat file, document, etc.), the use of data warehouses and distributed file systems (e.g., Hadoop distributed file system or HDFS) to store and access data has increased. For example, an HDFS can be implemented for databases having a flat file structure with predetermined delimiters, and associated metadata (e.g., describing the keys for the respective delimited data values), to accommodate a broad range of data types and structures.
In many cases, such distributed file systems can be “append only” data stores and can comprise fact tables with over a billion rows. Further, these data stores are continually being modified (e.g., new rows appended) with new data, precipitating challenges related to data quality (e.g., “freshness”, accuracy, etc.). The users of such large and dynamic datasets desire to query the datasets with a high level of performance, characterized by fast query response times and accurate query results, across various query engines (e.g., Impala, Spark SQL, Hive, Drill, Presto, etc.) and data storage environments (e.g., HDFS). One approach for providing such high performance querying might alter certain database structures to reduce access latency. Specifically, an aggregate of a certain portion of a dataset can be generated to facilitate a faster access to that portion of the dataset. In some cases, the aggregate might be generated dynamically based at least in part on a query or queries issued by the user. A database structure can also be altered by creating logical and/or physical dataset partitions (e.g., shards) to enable high performance querying. For example, a portion of a dataset that is accessed often might be partitioned to a cache memory and/or other low latency location (e.g., geographically closer data storage facility) to reduce access latency. In certain cases, such database structure alteration operations and/or other functions (e.g., query translation, query planning, etc.) can be implemented by a third party application in one or more layers between the business intelligence (BI) tools of the resource owner (e.g., data owner, user) and the computing and/or storage devices managing the access to the resource (e.g. data). In such cases, the third party application can facilitate a delegated authorization approach (e.g., using LDAP, Kerberos, SAML, OAuth, OpenID, etc.) to receive an authorization from the data owners to access their data using a set of credentials different than those of the resource owners. Such delegated authorization and/or authentication techniques can improve security and/or efficiency in the earlier described data analysis environments.
Unfortunately, legacy techniques for applying delegated data access authorization to altered database structures can be limited at least as pertaining to database structures that might be dynamically generated. As an example, database structure alterations pertaining to aggregates can inherently lose data information (e.g., underlying data details), but can also lose security information (e.g., underlying data access authorizations, permissions, etc.). Some legacy approaches might address such security information loss by inspecting the authorization attributes (e.g., permissions, etc.) associated with the underlying data of the aggregate to recreate permissions for the aggregate structures (e.g., aggregate tables, views, partitions, etc.). For example, a data warehousing environment might implement such an approach when building certain data warehouses for BI tool access. Such approaches, however, can be limited in environments that dynamically perform certain database alterations at query time. In such environments, for example, extracting and recreating authorization attributes for an aggregate might negate any efficiency improvements facilitated by the aggregate. Further, the extraction and/or re-creation methods implemented by various third party applications can differ substantially, resulting in various inefficiencies relating to the interaction of the numerous applications (e.g., tools) in the data analysis ecosystem. Further, the multiple database structures (e.g., relational, multidimensional, delimited flat file, document, etc.) comprising the foregoing distributed file systems can precipitate a more complex permissions extraction problem. Other legacy approaches might require each third party application to manage a respective set of authorization attributes to facilitate various database structures (e.g., aggregates, partitions, local caches, etc.) that might be accessed using the third party application. Such approaches can place a significant resource (e.g., computing, storage, human, etc.) burden on the third party application, third party application provider, resource management applications, and/or resource owner to manage multiple copies of authorization attributes across numerous enterprise users and/or third party applications in the ecosystem. In some cases, certain legacy approaches might merely bypass any authorization associated with an aggregate and/or other altered database structure. Such approaches can be particularly limited in high security data environments such as those related to healthcare or financial services.
What is needed is a technique or techniques to improve over legacy and/or over other considered approaches. Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.