The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
In the context of database systems, a “dimension” is a list of values that provide categories for data. A dimension acts as an index for identifying values of a variable. For example, if sales data has a separate sales figure for each month, then the data has a MONTH dimension. That is, the data is organized by month. A dimension is similar to a key in a relational database. Data that is organized by two or more dimensions is referred to as “multidimensional data”.
Any item of data within a multidimensional variable can be uniquely and completely selected by specifying one member for each of the variable's dimensions. For example, if a sales variable is dimensioned by MONTH, PRODUCT, and MARKET, specifying “January” for the MONTH dimension, “Stereos” for the PRODUCT dimension, and “Eastern Region” for the MARKET dimension uniquely specifies a single value for the variable. Thus, dimensions offer a concise and intuitive way of organizing and selecting data for retrieval, updating, and performing calculations.
Multidimensional data may be stored in relational database systems. When multidimensional data is stored in a relational database system, applications access the data by submitting commands that conform to the database language supported by the relational database system, the most common of which is the Structured Query Language (SQL).
Relational database systems store data in the form of related tables, where each table has one or more columns and zero or more rows. The conventional mechanism for storing multidimensional data in a relational database system is to store the data in tables arranged in what is referred to as a star schema. In relational database systems, a star schema is distinguished by the presence of one or more relatively large tables and several relatively smaller tables. Rather than duplicating the information contained in the smaller tables, the larger tables refer to rows in the smaller tables. Within a star schema, the larger tables are referred to as “fact tables”, while the smaller tables are referred to as “dimension tables”. FIG. 1 illustrates an exemplary star schema with two dimensions.
Referring to FIG. 1, it illustrates a database 100 that includes tables 102, 104, and 106. Table 102 is named “stores” and contains information about each of the stores in which sales may occur. Each row in the stores table 102 contains a unique store-id and information and the particular store that corresponds to the store-id. Table 104 is named “products” and contains information about each type of product that may be sold in any of the stores. Each row in the products table 104 contains a unique product-id and information about the particular product.
Table 106 is named “sales” and contains information about each sale in each of the stores represented in the stores table 102. Each row in the sales table 106 includes a dollar amount, a store-id to indicate the store at which the sale was made, a product-id to indicate the product sold in the sale, and the date of the sale. Typically, the number of sales will be vastly greater than both the number of stores at which the sales are made and the number of products carried by the stores. Detailed information about the store and product involved in a sale does not have to be stored in the rows of the sales table 106 because such detailed information is available in the stores table 102 and the products table 104, respectively. Instead, the rows of table 106 contain values (store-ids and product-ids) that reference information stored in the other tables 102 and 104. Therefore, tables 102, 104, and 106 constitute a star schema in which table 106 is a fact table and tables 102 and 104 are dimension tables.
The data stored in fact table 106 only has two dimensions, and fact table 106 has two columns (STORE-ID and PRODUCT-ID) dedicated to storing foreign key values for those dimensions. Typically, a fact table dedicates one column for storing foreign key values for each of the dimensions associated with the multidimensional data stored in the fact table. By storing foreign key values in the fact table that refer to rows in the dimension tables, the rows of the fact table can be kept to a relatively small size and the number of columns of the fact table can be kept to a relatively small number. For example, instead of the sales table 106 containing the values in the MANAGER, CITY, and STATE columns of the stores table 102 and the values in the SOURCE, PARTS, and COST columns of the products table 104, the sales table 106 contains foreign key values in two columns, one column referring to rows in the stores table 102 and the other column referring to rows in the products table 104. The number of rows in a typical fact table can be billions or more. In contrast, the number of rows in a dimension table is typically much lower (e.g., in the tens, hundreds, or thousands). Accordingly, a typical star schema is constructed to minimize the amount of data stored in each row of the fact table.
A query on multidimensional data may retrieve aggregates of fact table “measures” constrained by “dimension key values”. For example, a query issued against the tables illustrated in FIG. 1 may retrieve the sum of all “amount” values from sales table 106 that occurred in San Jose. In this example, execution of the query involves joining rows from the sales table 106 with the rows, from stores table 102, where the “city” column has the value “San Jose”.
This type of query is also referred to as a “star query”. The “measures” of a fact table are the values in columns of the fact table that do not contain foreign key values. For example, the values in the AMOUNT and DATE columns of the sales tables 106 are the sale amount measures and the date of sale measures, respectively. The “dimension key values” are the values associated with a particular dimension. For example, the dimension key values for a “region” dimension may be “Northern Region”, “Southern Region”, “Eastern Region” and “Western Region”.
In a star schema, the dimension key values of a dimension are typically stored in the dimension key column of the dimension table associated with the dimension. For example, the dimension key values of stores table 102 are stored in the “store-id” column of stores table 102. Similarly, the dimension key values of products table 104 are stored in the “product-id” column of products table 104.
Star queries often contain filter predicates on the dimension tables. The following is an example of a star query that joins the sales fact table 106 with qualified stores 102 and products 104 dimension tables. The stores dimension table 102 is qualified by the filter predicate st.state=“CA” and the products dimension table 104 is qualified by the filter predicate pr.cost>10. The example star query requests the sum of sale amounts, by store city and product source, for the state of California, for products that cost more than $10.
SELECT st.city, pr.source, SUM(sa.amount)FROM sales sa, stores st, products prWHERE sa.store-id = st.store-id AND sa.product-id = pr.product-id ANDst.state = “CA” AND pr.cost > 10GROUP BY st.city, pr.source
One approach for improving performance of a star query in a relational database management system is to organize related data in the fact table in contiguous disk blocks on disk. A “disk block” is a logical unit of data storage used by a relational database management system for storing database data. A disk block has a block size (e.g., 4 KB) and may encompass one or more underlying file system or operating system blocks. Among other information, a disk block may include the data of one or more rows of a table or tables, or the data of a row may span multiple disk blocks.
In the context of storing multidimensional data in relational database management systems, the organization of related fact table data in contiguous disk blocks on disk is referred to as “clustering”. Clustering can improve performance of star queries in relational database management systems by facilitating compression, indexing, and Input/Output (I/O) pruning. In the context of answering a star query in a relational database management system, I/O pruning refers to avoiding disk scans of certain disk blocks of the fact table that are known not to contain data relevant to the star query when scanning the fact table for data that is relative to the star query. Recent technological improvements in data storage devices have improved database access to the point that database systems sometimes perform table scans in lieu of indexed table access when executing a query, even where indexes are available. Consequently, techniques for clustering to facilitate I/O pruning have recently gained considerable attention in the industry.
There are a variety of different approaches to clustering a fact table in a relational database management system. In one approach, rows of the fact table are stored on-disk in a linear order based on the values in one or more columns of the rows of the fact table. For example, rows of the fact table can first be ordered by a query with an ORDER BY clause specifying one or more columns of the fact table, and then storing the rows in the sorted order on-disk in one or more contiguous disk blocks. In another approach, a fact table is clustered along multiple dimensions based on values in columns of the fact table according to a space filing curve such as a Z-order curve or a Hilbert space filing curve. In both approaches, the fact table is clustered based only on the columns of the fact table. These approaches facilitate I/O pruning when the fact table is clustered based on the columns of the fact table when the star query contains filter predicates on those columns.
Unfortunately, the performance bottleneck of executing a star query in relational database management systems is typically the join between the relatively large fact table and the dimension tables, referred to as a “star join”. The dimension tables are often qualified in a star query by one or more, often highly selective, filter predicates on the dimension tables. In many cases, a star query does not contain any filter predicates on the columns of the fact table. For example, the example star query above contains the filter predicate st.state=“CA” that qualifies the stores dimension table 102, and contains the filter predicate pr.cost>10 that qualifies the products dimension table 104, but does not contain any filter predicate on the AMOUNT column or the DATE column of the sales table 106.
A relational database management system that clusters a fact table based only on the columns of the fact table may perform wasteful disk scanning of the fact table when performing a star join. The disk scan may be wasteful because, in the absence of any filter on the fact table columns, the clustering of the fact table does not help, and all disk blocks of the fact table may need to be scanned. Specifically, even if the sales table 106 is clustered based on “date”, the rows of the sales table 106 that join with the rows of stores table 102 where st.state=“CA”, and with the rows of products table 104 where pr.cost>10, may be randomly distributed within the sales table 106.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.