Storing Multidimensional Data
Database systems are often designed to maintain huge amounts of information about a variety of entities, events, or occurrences (referred to generally as occurrences), and these occurrences may be described by a variety of characteristics. Even database systems that do not yet contain huge amounts of information are often designed to be scalable such that the database systems can be adapted to accommodate huge amounts of information. Huge tables, which may include every occurrence and every characteristic of every occurrence, may be impossible to analyze if there are not enough resources to store and process significant portions of these tables. Even if sufficient resources are available, storing and processing significant portions of these huge tables can be quite costly. As a result, when occurrences have many characteristics or are otherwise related to a variety of information, many database systems separate such information about the occurrences into multiple tables.
Database systems often group information into tables based on categories of characteristics. Much of the information may be descriptive information about entities, categories, or classes of information (referred to generally as categories) involved in the occurrences. The description of these underlying categories may change infrequently compared to the other tables that record or measure the occurrences themselves. Dimension tables are tables that contain descriptive information about occurrences that are referenced by or may be referenced by other table(s). The other table(s) include column(s) that reference row(s) of the dimension table(s), and each referencing column identifies what is referred to as a dimension of column(s) that occur in dimension table(s). Data that is organized into two or more dimensions is referred to herein as multidimensional data.
Fact tables are the other tables that measure the occurrences related to the categories. In other words, fact tables store facts or measurable quantitative data, and this measurable data may be involved with or otherwise fall under the categories. By referencing the dimension tables, the fact tables do not need to duplicate all of the information contained in the dimension tables. Generally, because fact tables may include multiple occurrence(s) that reference the same category, fact tables are usually larger than dimension tables. Also, because fact tables measure the occurrences rather than recording the definitions, the fact tables are usually updated more frequently than dimension tables. An organization of multidimensional data into fact table(s) and dimension table(s) is referred to as a star schema.
In various examples, dimensions may range from five, to tens of thousands, to millions of rows, and the fact table could be from millions to billions of rows.
In one example, a product may be defined and described in one table, and that product as well as other products may appear in sales records of a sales table. Customers may also be described in one table and referenced in the sales table. The sales records in the sales table may correspond to occurrences of sales by a business, and, in particular, may identify which products and customers were involved in the respective sales. The sales records might not provide much if any other description of the products or customers that were involved in the respective sales. Instead, the sales records may refer to the customers and products tables using keys.
A foreign key is a key that is used by a fact table to refer to a candidate key of a dimension table, and the candidate key uniquely identifies records in the dimension table. The foreign key and the candidate key may be the same key or may be mapped to each other according to a mapping.
Tables 102, 104, and 106 of FIG. 1 illustrate an exemplary star schema with two dimensions. As shown, storage device(s) 100 store(s) tables 102, 104 and 106. Table 102 is named “SALES” and contains information about sales that occurred for a business. Each row in sales table 102 contains a unique transaction id (TID) and information about how many units were sold (QTY). Table 104 is named “PROD” and contains information about each type of product that may be sold in any of the stores. Each row in products table 104 contains a unique product-id (PID) and information about the particular product. Table 106 is named “GEOG” and contains information about locations where products may be sold. Each row in geography table 106 contains a unique geography-id (GID) and information about the location.
Sales table 102 also includes the PID and GID columns, which map to corresponding rows in the product table 104 and geography table 106, respectively. For each row in sales table 102, the PID column indicates the type of product that was sold in a particular transaction, and the GID column indicates where the product was sold. The number of transactions may be vastly greater than both the number of locations and the number of products carried by the stores. Detailed information about the product and location involved in a sale transaction does not have to be stored in the rows of table 102 because such detailed information is available in tables 104 and 106, respectively. Instead, the rows of table 102 simply contain values (GIDs and PIDs) that reference information stored in the other tables 104 and 106. Therefore, tables 102, 104 and 106 constitute a star schema in which table 102 is the fact table and tables 104 and 106 are dimension tables.
The data stored in fact table 102 has only two dimensions, and therefore fact table 102 has only two columns dedicated to storing foreign key values for those dimensions. In general, a fact table must dedicate one column for storing foreign key values for each of the dimensions associated with the multidimensional data stored in the fact table. Thus, a fact table that stores data associated with twenty dimensions would dedicate twenty columns to the storage of foreign key values.
Data stored in relational database systems (“ROLAP” systems), for example, according to a star schema, is accessible to all applications that support interaction with such relational systems. Such database applications communicate with the relational database system by submitting commands that conform to the database language supported by the relational database system, the most common of which is the Structured Query Language (SQL). However, relational queries are not designed to handle multidimensional data because relational queries join two tables at a time using local join operations.
Multidimensional data may alternatively be stored in specialized, multidimensional database systems (“MOLAP” systems). Multidimensional database systems provide structures and access techniques specifically designed for multidimensional data. When data is stored in specialized multidimensional database systems, only applications that are specially built to interact with those multidimensional database systems are able to access and manipulate the data. Also, multidimensional star queries describe an entire cube of data before further processing even if the queries request data from only a small portion of the cube.
An alternative approach to managing multidimensional data in a relational database involves storing the data in relational files but maintaining all multidimensional structure, metadata, administration, and access control using multidimensional database system techniques. Accessing relationally-stored data using multidimensional techniques poses numerous difficulties. For example, when all administration and access to the multidimensional data are controlled exclusively through the multidimensional database system engine, two database management systems must be administered. Further, database applications that access data using conventional relational commands (e.g. SQL commands) are unable to access the multidimensional data.
Querying Multidimensional Data
A query is an expression, command, or set of commands that, when executed, causes a server to perform an operation on a set of data. A query may be processed in a predictable order by query processors. For example, the order of precedence may be FROM, WHERE, GROUP BY, HAVING, SELECT, and ORDER BY. Query processors may vary the order if varying the order is predicted to be more efficient as long as the varying the order would not change a result of the query.
In query optimization, a query optimizer or other query processor may transform queries from one form to another form as long as the two forms are semantically equivalent to each other. As used herein, a query is “transformed” when the query is (a) rewritten from a first form to a second form, (b) received in a manner that specifies a first set of operations, such as a first form or a first execution plan, and executed or planned for execution using a second set of operations, such as the operations specified by a second form or second execution plan. An execution plan is a set of directives, such as a query tree structure, that is prepared for an execution engine. Two queries or execution plans are semantically equivalent to each other when the two queries or execution plans, if executed, would produce equivalent result sets, even if the result sets are assembled in different manners by the two queries or execution plans. Execution of a query is semantically equivalent to a query or execution plan if the query execution produces a result set that is equivalent to the one that would be produced by the query or execution plan, if executed.
Queries that operate on data stored in tables that belong to a star schema are referred to as star queries. Star queries often request information from a fact table with constraints that are based on characteristics listed in the dimension tables. For example, a star query may request all sales records that involved customers between the ages of 25 and 30. Although the fact table may include all sales records and identify the customers involved in those sales, the fact table likely does not list the respective ages of those customers. Therefore, evaluation of the star query requires a determination of which fact table records identify customers that fall within the requested ages. Such a determination may consume significant amounts of resources for large fact tables and multiple constraints.
In one example, the following SQL query, Query 1, may be evaluated to determine the dates of all sales of products that cost more than $1,000 from stores in San Jose.
QUERY 1select sales.date from sales, stores, productswhere sales.store-id = stores.store-idand sales.product-id = products.product-idand stores.city = San Joseand products.cost > $1,000
To speed up query evaluation, a query may be executed using indexes, materialized views, or solved cubes that are built using data from the dimension tables and the fact table. For example, Query 1 may be executed by (1) using an index built on the city column of the stores table 102 to determine the set of all of the stores that are in San Jose (“SET1”), (2) using an index built on the cost column of the products table 104 to determine the set of all of the products that cost more than $1,000 (“SET2”), and (3) using an index built on the store-id, product-id columns of the sales table 106 to determine all rows that include both a store-id from SET1 and a product-id from SET 2.
Unfortunately, these methods of executing star queries have significant drawbacks. Specifically, building indexes, materialized views, or solved cubes for each possible combination of fact table columns may be cost prohibitive, especially when there are many dimensions and the fact table contains numerous foreign key columns. If indexes or materialized views are not built for all combinations of columns, then an index or materialized view built on the exact columns of the fact table that are constrained by any given query may not exist.
Additionally, indexes or materialized views built on a superset of the columns constrained by a query are not always helpful. For example, an index built on key1, key2, key3 is useless for a query that does not specify a constraint for key1. Use of indexes built on a subset of the columns constrained by a query may not be efficient either. For example, if an index exists for the “store-id” column of sales table 106, but not for the store-id, product-id columns, then all of the rows in the sales table 106 that contain store-ids from SET1 would have to be retrieved and inspected to determine if they also contain product-ids from SET2. Since each constraint alone may not significantly constrain the query, conventional join techniques typically require scanning a large portion of the fact table when only a very small number of the rows in the fact table actually satisfy all of the constraints of the query.
Another problem with the method of executing star queries described above is that the number of value combinations that satisfy the specified criteria may be vastly larger than the actual number of rows that satisfy the specified criteria. For example, there may be 50,000 products that cost more than $1,000 and there may be 50 stores in San Jose. However, products that cost more than $1,000 may have been sold only three times in San Jose. In the example, the store-id, product-id index would be accessed for 250,000 combinations of store-id, product-id in order to find the three rows that actually satisfy the criteria specified in the query. Further, the number of combinations that must be tested increases greatly as the number of dimension tables referenced by the query increases.
An alternative approach to executing star queries is described in Graefe, G. O'Neil, P. E. “Multiple-table Joins Through Bitmapped Join Indices”, SIGMOD Record, Vol. 24, No. 3, September, 1995. According to this approach, a “join index” is used to access the fact table. A join index assumes a join on a particular set of columns between the fact table and one or more dimension tables. If the join is between the fact table and a dimension table on the normal join columns, it is possible to create a join index for other dimension table columns and the fact table.
A join index has index entries that may have the form <key, bitmap>, where “key” is a value from the dimension table column on which the index is built, and “bitmap” is a bitmap of the fact table indicating those rows in the fact table that would match the key according to the assumed join. Alternatively, the join index can have the format <rowid, bitmap>, where “rowid” identifies a particular row in a dimension table and “bitmap” represents the fact table rows that join with the particular row in the dimension table. Join indexes may also employ an actual list of rowids instead of bitmaps in the index entries.
Creating a join index is tantamount to precomputing a join. Join indexes may greatly increase the cost and/or complexity of doing updates and loading new data, as the join indexes would also need to be updated.
Another approach for processing star queries is to transform the star query to add subqueries that are based on join predicates and constraints on dimension tables that are contained in the original query. The subqueries are executed, and the values returned by the subqueries are used to access one or more bitmap indexes built on columns of the fact table. The bitmaps retrieved for the values returned by each subquery are merged to create one subquery bitmap per subquery. An AND operation is performed on the subquery bitmaps, and the resulting bitmap is used to determine which data to retrieve from the fact table.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.