Storing Multidimensional Data
Database systems are often designed to maintain huge amounts of information about a variety of entities, events, or occurrences (referred to generally as occurrences), and these occurrences may be described by a variety of characteristics. Even database systems that do not yet contain huge amounts of information are often designed to be scalable such that the database systems can be adapted to accommodate huge amounts of information. Huge tables, which may include every occurrence and every characteristic of every occurrence, may be impossible to analyze if there are not enough resources to store and process significant portions of these tables. Even if sufficient resources are available, storing and processing significant portions of these huge tables can be quite costly. As a result, when occurrences have many characteristics or are otherwise related to a variety of information, many database systems separate such information about the occurrences into multiple tables.
Database systems often group tables based on categories of characteristics. Much of the information may be descriptive information about entities, categories, or classes of information (referred to generally as categories) involved in the occurrences. The description of these underlying categories may change infrequently compared to the other tables that record or measure the occurrences themselves. Dimension tables are tables that contain descriptive information about occurrences that are referenced by or may be referenced by other table(s). The other table(s) include column(s) that reference row(s) of the dimension table(s), and each referencing column identifies what is referred to as a dimension of column(s) that occur in dimension table(s). Data that is organized into two or more dimensions is referred to herein as multidimensional data.
Fact tables are the other tables that measure the occurrences related to the categories. In other words, fact tables store facts or measurable quantitative data, and this measurable data may be involved with or otherwise fall under the categories. By referencing the dimension tables, the fact tables do not need to duplicate all of the information contained in the dimension tables. Generally, because fact tables may include multiple occurrence(s) that reference the same category, fact tables are usually larger than dimension tables. Also, because fact tables measure the occurrences rather than recording the definitions, the fact tables are usually updated more frequently than dimension tables. An organization of multidimensional data into fact table(s) and dimension table(s) is referred to as a star schema.
In various examples, dimensions may have millions of rows, and the fact table could have billions of rows. However, the techniques described herein are not limited to any particular ranges for fact tables or dimension tables.
In one example, a product may be defined and described in one table, and that product as well as other products may appear in sales records of a sales table. Customers may also be described in one table and referenced in the sales table. The sales records in the sales table may correspond to occurrences of sales by a business, and, in particular, may identify which products and customers were involved in the respective sales. The sales records might not provide much if any other description of the products or customers that were involved in the respective sales. Instead, the sales records may refer to the customers and products tables using keys.
A foreign key is a key that is used by a fact table to refer to a candidate key of a dimension table, and the candidate key uniquely identifies records in the dimension table. The foreign key and the candidate key may be the same key or may be mapped to each other according to a mapping.
Tables 102, 104, and 106 of FIG. 1 illustrate an exemplary star schema with two dimensions. As shown, storage device(s) 100 store tables 102, 104 and 106. Table 102 is named “SALES” and contains information about sales that occurred for a business. Each row in sales table 102 contains a unique transaction id (TID) and information about how many units were sold (QTY). Table 104 is named “PROD” and contains information about each type of product that may be sold in any of the stores. Each row in products table 104 contains a unique product-id (PID) and information about the particular product. Table 106 is named “GEOG” and contains information locations about where products may be sold. Each row in geography table 106 contains a unique geography-id (GID) and information about the location.
Sales table 102 also includes the PID and GID columns, which map to corresponding columns in the product table 104 and geography table 106, respectively. For each row in sales table 102, the PID column indicates the type of product that was sold in a particular transaction, and the GID column indicates where the product was sold. The number of transactions may be vastly greater than both the number of locations and the number of products carried by the stores. Detailed information about the product and location involved in a sale transaction does not have to be stored in the rows of table 102 because such detailed information is available in tables 104 and 106, respectively. Instead, the rows of table 102 simply contain values (GIDs and PIDs) that reference information stored in the other tables 104 and 106. Therefore, tables 102, 104 and 106 constitute a star schema in which table 102 is the fact table and tables 104 and 106 are dimension tables.
The data stored in fact table 102 only has two dimensions, and therefore fact table 102 only has two columns dedicated to storing foreign key values for those dimensions. In general, a fact table must dedicate one column for storing foreign key values for each of the dimensions associated with the multidimensional data stored in the fact table. Thus, a fact table that stores data associated with twenty dimensions would dedicate twenty columns to the storage of foreign key values.
Data stored in relational database systems (“ROLAP” systems), for example, according to a star schema, is accessible to all applications that support interaction with such relational systems. Such database applications communicate with the relational database system by submitting commands that conform to the database language supported by the relational database system, the most common of which is the Structured Query Language (SQL).
Multidimensional data may alternatively be stored in specialized, multidimensional database systems (“MOLAP” systems). Multidimensional database systems provide structures and access techniques specifically designed for multidimensional data. When data is stored in specialized multidimensional database systems, only applications that are specially built to interact with those multidimensional database systems are able to access and manipulate the data.
An alternative approach to managing multidimensional data in a relational database involves storing the data in relational files but maintaining all multidimensional structure, metadata, administration, and access control using multidimensional database system techniques. Accessing relationally-stored data using multidimensional techniques poses numerous difficulties. For example, when all administration and access to the multidimensional data are controlled exclusively through the multidimensional database system engine, two database management systems must be administered. Further, database applications that access data using conventional relational commands (e.g. SQL commands) are unable to access the multidimensional data.
In one approach, multiple foreign key values of each row in the fact table are mapped to and replaced by a “replacement” value. The mapping allows the database server to derive a replacement value from any given combination of foreign key values. The database server may use an inverse mapping function to derive the keys from the fact table, where they are compressed, but inverse the mapping function does not provide any information beyond what was already stored in the fact table. This mapping function may help reduce the number of foreign key values represented in the fact table, but the mapping function is not helpful during query evaluation for queries that are constrained by several non-key values.
In another approach, the fact table may be modified to be sorted by or otherwise include information about additional characteristics that are useful for filtering purposes. However, maintaining this information in the fact table is cumbersome, and including such information in the fact table also increases the amount of resources required to load rows of the fact table during query evaluation.
Querying Multidimensional Data
Queries that operate on data stored in tables that belong to a star schema are referred to as star queries. Star queries often request information from a fact table with filters that are based on characteristics listed in the dimension tables. For example, a star query may request all sales records that involved customers between the ages of 25 and 30. Although the fact table may include all sales records and identify the customers involved in those sales, the fact table likely does not list the respective ages of those customers. Therefore, evaluation of the star query requires a determination of which fact table records identify customers that fall within the requested ages. Such a determination may consume significant amounts of resources for large fact tables and multiple filters.
In one example, the following SQL query, Query 1, may be evaluated to determine the dates of all sales of products that cost more than $1,000 from stores in San Jose.
QUERY 1select sales.date from sales, stores, productswhere sales.store-id = stores.store-idand sales.product-id = products.product-idand stores.city = San Joseand products.cost > $1,000
To speed up query evaluation, a query may be executed using indices, materialized views, or solved cubes that are built using data from the dimension tables and the fact table. For example, Query 1 may be executed by (1) using an index built on the city column of the stores table 102 to determine the set of all of the stores that are in San Jose (“SET1”), (2) using an index built on the cost column of the products table 104 to determine the set of all of the products that cost more than $1,000 (“SET2”), and (3) using an index built on the store-id, product-id columns of the sales table 106 to determine all rows that include both a store-id from SET1 and a product-id from SET 2.
Unfortunately, these method of executing star queries have significant drawbacks. Specifically, building indexes, materialized views, or solved cubes for each possible combination of fact table columns may be cost prohibitive, especially when there are many dimensions and the fact table contains numerous foreign key columns. If indexes or materialized views are not built for all combinations of columns, then an index or materialized view built on the exact columns of the fact table that are constrained by any given query may not exist.
Additionally, indexes or materialized views built on a superset of the columns constrained by a query are not always helpful. For example, an index built on key1, key2, key3 is useless for a query that does not specify a filter for key1. Use of indexes built on a subset of the columns constrained by a query may not be efficient either. For example, if an index exists for the “store-id” column of a sales table, but not for the store-id, product-id columns, then all of the rows in the sales table that contain store-ids from SET1 would have to be retrieved and inspected to determine if they also contain product-ids from SET2. Since each filter alone may not significantly constrain the query, conventional join techniques typically require scanning a large portion of the fact table when only a very small number of the rows in the fact table actually satisfy all of the filters of the query.
Another problem with the method of executing star queries described above is that the number of value combinations that satisfy the specified criteria may be vastly larger than the actual number of rows that satisfy the specified criteria. For example, there may be 50,000 products that cost more than $1,000 and there may be 50 stores in San Jose. However, products that cost more than $1,000 may have been sold only three times in San Jose. In the example, the store-id, product-id index would be accessed for 250,000 combinations of store-id, product-id in order to find the three rows that actually satisfy the criteria specified in the query. Further, the number of combinations that must be tested increases greatly as the number of dimension tables referenced by the query increases.
An alternative approach to executing star queries is described in Graefe, G. O'Neil, P. E. “Multiple-table Joins Through Bitmapped Join Indices”, SIGMOD Record, Vol. 24, No. 3, September, 1995. According to this approach, a “join index” is used to access the fact table. A join index assumes a join on a particular set of columns between the fact table and one or more dimension tables. If the join is between the fact table and a dimension table on the normal join columns, it is possible to create a join index for other dimension table columns and the fact table.
A join index has index entries that may have the form <key, bitmap>, where “key” is a value from the dimension table column on which the index is built, and “bitmap” is a bitmap of the fact table indicating those rows in the fact table that would match the key according to the assumed join. Alternatively, the join index can have the format <rowid, bitmap>, where “rowid” identifies a particular row in a dimension table and “bitmap” represents the fact table rows that join with the particular row in the dimension table. Join indexes may also employ an actual list of rowids instead of bitmaps in the index entries.
Creating a join index is tantamount to precomputing a join. Join indexes may greatly increase the cost and/or complexity of doing updates and loading new data, as the join indexes would also need to be updated.
Another approach for processing star queries is to transform the star query to add subqueries that are based on join predicates and filters on dimension tables that are contained in the original query. The subqueries are executed, and the values returned by the subqueries are used to access one or more bitmap indexes built on columns of the fact table. The bitmaps retrieved for the values returned by each subquery are merged to create one subquery bitmap per subquery. An AND operation is performed on the subquery bitmaps, and the resulting bitmap is used to determine which data to retrieve from the fact table.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.