The term “dimension” refers to a related set of distinct values. For example, a TIMES dimension can include all dates from January 1998 to December 2003. Similarly, a PRODUCTS dimension can include values representing all possible products of a company.
A set of data items is “dimensional” if each data item in the set is associated with a value from a particular dimension. For example, assume that each row of a table includes data about a particular event, including the date of the event. In this example, the “event data” is “dimensional” relative to the TIMES dimension.
A set of data items is “multi-dimensional” if the data items are dimensional relative to more than one dimension. For example, assume that each row of a SALES table includes data for a particular sale, including (1) the date of the sale, (2) the product sold, and (3) the region in which the sale was made. In this example, the “sales data” is multidimensional, since the sales data is dimensional relative to the TIMES, PRODUCTS and REGION dimensions.
A table that stores multi-dimensional data is often referred to as a “fact table”. Tables that store the dimension values of a particular dimension are referred to as “dimension tables”. Thus, the same database that would have the SALES table described above would typically also include a TIMES table, a PRODUCTS table, and a REGION table.
Each row of a fact table will correspond to a dimensional value combination that includes one value for each of the dimensions. For example, in the SALES table described above, each row will typically correspond to a combination of TIMES, PRODUCTS and REGION dimension values. The set of dimension values associated with a given SALES table row may be represented as (t, p, r), where t is the value for the TIMES dimension, p is the value for the PRODUCTS dimension, and r is the value for the REGION dimension.
Typically, not all dimension value combinations will have a corresponding row in the fact table. Thus, the set of dimension value combinations that are associated with rows in the fact table is a subset of the cross-product of the dimension values from each of the dimensions.
A fact table is referred to as “dense” along a dimension ‘D’ if the fact table contains all possible values of ‘D’ for any given combination of the other dimensional values in the fact table. For example, assume that the REGION dimension has only three possible values RGN1, RGN2 and RGN3. The SALES table is dense relative to the REGION dimension if, for every combination (t, p) reflected in the SALES table, the SALES table includes rows for the dimension value combinations (t, p, RGN1), (t, p, RGN2) and (t, p, RGN3).
“Densification” is the process of making a set of data denser than it originally was along a dimension of interest. A set of rows may be densified, for example, by creating dummy rows for missing combinations of dimensional values. Data whose density has been increased will be referred to as having been “densified, ” and the dimension with respect to which it has been or is being increased will be referred to as the “densifying” dimension.
Densification is useful for a variety of situations. For example, queries in some multidimensional database systems (e.g., On-Line Analytical Processing (OLAP)) require data to be densified along the time dimension. Also, some users, such as OLAP users, are accustomed to seeing the data in a densified format, especially when window functions are computed and presented. For example, if there are no sales for a particular day, some users still want to see a display showing the running total of sales, the day, and a blank space in the sales column (because the running total is a window function in OLAP, which usually displays densified data).
Using Structured Query Language (SQL), densification may be performed by a series of operations that include the DISTINCT, CROSS JOIN and OUTER JOIN operations. As an example, consider a database schema that includes the following tables:                SALES (time_id, prod_id, amount_sold)        TIMES (time_id, calendar_month_desc, calendar_quarter_desc, Calendar_year)        PRODUCTS (prod_id, prod_subcategory, prod_category)        
In the above set of tables, the SALES fact table stores a measure (amount sold) that is dimensional relative to the TIMES and PRODUCTS dimensions. For each row in the SALES table, the “time_id, ” column stores the time dimension value for the row, and the “product_id, ” column stores the PRODUCTS dimension value for the row. Thus, for a given combination of a product value and a time value included in the SALES fact table, the amount_sold column stores the measure value “amount sold”.
The TIMES dimension table stores all time_id values, and details about the time that corresponds to each time_id value. Similarly, the PRODUCTS dimension table stores all product_id values, and details about the product that correspond to each product_id value.
Suppose, the data in SALES is sparse along the TIMES dimension. A query, Q1, that will densify the SALES data along the TIMES dimension is Q1:
SELECT V2.prod_id, V2.time_id, V1.amount_soldFROM SALES V1 RIGHT OUTER JOIN(  (SELECT DISTINCT prod_id    FROM SALES) CROSS JOIN  (SELECT time_id    FROM TIMES)) V2ON (V1.prod_id = V2.prod_id AND V1.time_id = V2.time_id);                In Q1, V2 includes all combinations of (1) prod_id values that are actually in the SALES table, and (2) all time_id values in the TIMES table. In particular, the “SELECT DISTINCT prod_id FROM SALES” clause finds all of the unique values of prod_id found in the SALES table. Similarly, the “SELECT time_id FROM TIMES” clause finds all values of time_id in the TIMES table. The “CROSS JOIN” construct causes a cross product to be taken between the unique prod_ids found and all time_ids.        
The RIGHT OUTER JOIN operation between V1 (the SALES table) and V2 produces a result set that includes (1) all of the rows in the SALES table, and (2) empty rows that correspond to the time_id and prod_id combinations, from V2, which are not found in the SALES table.
FIG. 1 is a flowchart showing method 100 for performing the densification, which is used in Q1. In step 102, a sort is performed to obtain all DISTINCT values of prod_id in the SALES table. In step 104, a CROSS JOIN of all prod_id values with time_id values in the TIMES table is performed, thereby obtaining a set of (prod_id, time_id) dimension value combinations that is dense in the Time dimension, but includes only those values of prod_id found in the SALES table. In step 106, an OUTER JOIN of the SALES fact table with the results of the CROSS JOIN of step 104 is performed, thereby adding blank rows to the SALES table for any (prod_id, time_id) dimension value combination of the CROSS JOIN not found in the original SALES table.
As another example using the same fact table and dimension tables, suppose the user is interested in a running value of year-to-date (YTD) sales for each product on each day. Assuming that the data is sparse, a query, Q2, expressed in SQL that produces the YTD sales data is
Q2:
SELECT V2.prod_id, V2.time_id, SUM(sales) OVER  (PARTITION BY V2.prod_id, V2.year ORDER BY  V2.time_id) YTD_salesFROM SALES V1 RIGHT OUTER JOIN  (    (SELECT DISTINCT prod_id      FROM SALES) CROSS JOIN    (SELECT time_id, calendar_year year      FROM TIMES)  ) V2ON (V1.prod_id = V2.prod_id AND V1.time_id = V2.time_id);In Q2, the “FROM SALES . . . ” statement is identical to Q1, and causes the same sequence of operations to occur. The “SELECT V2.prod_id . . . ” statement, sums the sales values. The sum is returned in the YTD_sales variable. The data returned is partitioned according to product and year by the standard SQL PARTITION BY construct, and then ordered by time_id using the ORDER BY construct.
Similar to the example of Q1, calculating the year-to-date sales requires: a sort to get all DISTINCT values of prod_id in SALES table, (2) a cross join of all prod_id values with time_id values in the TIMES table, (3) an OUTER JOIN of the SALES table with the result of the CROSS JOIN in (2). Additionally, this query performs (4) a sort of the OUTER JOIN results in (3) on columns (prod_id, year, time_id) to compute the window function associated with the YTD.
The present inventors have recognized that the sort of step 102 (or step (1)) is an extra computation that is not needed for the final result, but cannot be avoided in the prior art. Additionally, the present inventors have recognized that Q1 is not intuitive. Especially with more complicated densification queries, it can be very difficult to decipher the intent of the user by inspecting the statement, because of the unintuitive nature of the series of operations used to perform the densification. Thus, the structure of the current way of performing densification within query is complex, hard to understand, and inefficient to compute.