1. Field of the Invention
The present invention relates to a method, system, and program for performing a join operation on a multi-column table and satellite tables with a join condition and, in particular, joining multiple dimension tables with a fact table in a star join.
2. Description of the Related Art
Data records in a computerized relational database management system (RDBMS) are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. One or more indexes may be associated with each table. An index is an ordered set of pointers to data records in the table based on the data in one or more columns of the table. In some cases, all the information needed by a query may be found in the index, making it unnecessary to search the actual table. An index is comprised of rows or index entries which include an index key and a pointer to a database record in the table having the key column values of the index entry key. An index key is comprised of key columns that provide an ordering to records in a table. The index key columns are comprised of the columns of the table, and may include any of the values that are possible for that particular column. Columns that are used frequently to access a table may be used as key columns.
Organizations may archive data in a data warehouse, which is a collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. One data warehouse design implementation is known as star schema or multidimensional modeling. The basic premise of star schemas is that information is classified into two groups, facts and dimensions. A fact table comprises the main data base records concerning the organization""s key transactions, such as sales data, purchase data, investment returns, etc. Dimensions are tables that maintain attributes about the data in the fact table. Each dimension table has a primary key column that corresponds to a foreign key column in the fact table. Typically, the fact table is much larger than the related dimension tables.
The fact table typically comprises numerical facts, such as the date of a sale, cost, type of product sold, location, site of sale, etc. The dimension table usually provides descriptive textual information providing attributes on one of the fact table columns. For instance, a time dimension table can provide attributes on the date column in the fact table describing the date of sale. The time dimension table may provide various weather conditions or events that occurred on particular dates. Thus, the time dimension table provides attributes on the time, i.e., weather, important events, etc., about data columns in the fact table.
The star schema provides a view of the database on dimension attributes that are useful for analysis purposes. This allows users to query on attributes in the dimension tables to locate records in the fact table. A query would qualify rows in the dimension tables that satisfy certain attributes or join conditions. The qualifying rows of the dimension tables have primary keys that correspond to foreign keys in the fact table. A join operation, such as an equijoin or natural join, is then performed on the qualifying rows of the joined dimension tables and the fact table. This join results in returning fact table entries that match the rows of the joined dimension tables, i.e., fact table entries that satisfy the search criteria on the dimension tables. Thus, join operations are used to query a fact table on dimension table attributes.
A join operation combines or concatenates the rows from the different dimension tables according to a condition or predicate to determine values to apply against the fact table. This is distinguished from a Cartesian product which concatenates every row from one table with every other row from another table without regard to a condition or predicate to exclude rows from the result. Rows from the tables involved in a join operation that do not satisfy the predicate or condition are excluded from the join result. Because the dimension tables are unrelated, the rows of the dimension tables that satisfy the join condition are then concatenated in every possible combination.
The Cartesian product of the rows of the dimension tables provides a data view of the entire space, i.e., Cartesian space, of every possible combination of the possible dimension table values. The join result, on the other hand, is a subset of the Cartesian space that is limited to those Cartesian space points that satisfy the join or search condition. One common type of join operation is an equijoin. An equijoin combines two rows from different tables that are equal according to some attribute. Once the combination of all dimension table rows that satisfy the search criteria is produced, the resulting rows are then applied to the Fact table in an equijoin operation to locate those rows in the fact table that have the same values as the rows resulting from the join on the dimension tables. Typically, the primary key columns of the dimension tables in the join result are compared against the corresponding foreign key columns in the Fact table to produce the equijoin results.
In multi-dimensional analysis, it is often desirable to form a query on the attributes specified in the dimension tables and then locate all records in the fact table that satisfy the criteria on the dimension table attributes. To perform such a query, the query engine joins the dimension tables on the conditions specified in the search criteria. The query engine then equijoins the dimension tables with the fact table to produce join results that satisfy the join condition.
The above query technique using join operations is very inefficient because the results of the join operation on the dimension tables may produce numerous concatenations that do not exist in the fact table. In fact, it has been found that on average only 1% or less of the concatenated results of the join operation on the dimension tables have corresponding matching entries in the Fact table that would concatenate in an equijoin operation. Nonetheless, prior art techniques would attempt to join all of the join results from the dimension tables to the fact table even though many of these attempted joins would not produce results as less than 1% of the concatenated results of the dimension table joins have corresponding matches in the fact table. Thus, numerous join operations are performed for which there will be no join result, thereby needlessly consuming I/O operations to perform the non-matching join operations.
FIG. 1 illustrates an example of a star schema 2 with multiple dimension tables 4, 6, and 8 and a fact table 10. The fact table 10 includes sales data, wherein each record includes information on the amount sold in the AMOUNT column 12; the time of sale in the TID column 14, which includes a time identifier; the product sold in the PID column 16 which is a product identifier; and the location of the sale, e.g., store location, in the GID column 18, which is a geographic identifier. The dimension tables 4, 6, and 8 provide attributes on the TID 14, PID 16, and GID 18 columns in the fact table.
The primary key columns of each of the dimension tables 4, 6, 8 are the TID column 20, PID column 28, and GID column 36, respectively. The columns 14, 16, and 18 in the fact table 10 are foreign keys that correspond to primary keys 20, 28, and 36 of the dimension tables 4,6, 8 that provide attributes on the data in the fact table 10. For instance dimension table 4 provides attributes for each possible TID value, including month information in column 22, quarter of the TID in the quarter column 24, and the year of the TID in the year column 26. Dimension table 6 provides product attributes for each PID value, including the product item in item column 30, the class of the product in the class column 32, and the inventory location of the product in inventory column 34. The dimension table 8 provides attributes for each possible GID value, including the city of the GID in the city column 38, the geographical region in the region column 40, and the country in the country column 42.
To locate records in the fact table 10 with a query on the attributes of multiple dimension tables 4, 6, and 8, the query engine would first join the dimension tables 4, 6, and 8 according to conditions specified on the search criteria of the query. The results of the join of the dimension tables would be equijoined with the Fact table to find rows in the fact table 10 that match the attributes in the rows of the joined dimension table on time 4, product 6, and geographic location 8. The number of comparisons of the rows formed from the joined dimension tables could require a vast magnitude of calculations. For instance, if the dimension table values that satisfied the search criteria included 60 time values, 50,000 product values, and 1,000 geographical locations, then the concatenation of these rows in a join operation would produce 3 billion possible values to apply against the fact table 10 in an equijoin, even though likely only 1% of the entries in the fact table 10 would be concatenated in an equijoin.
There is thus a need in the art for an improved method for performing star join queries on multiple dimension tables.
To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, and program for performing a join operation on a multi-column table and at least two satellite tables having a join condition. Each satellite table is comprised of multiple rows and at least one join column. The multi-column table is comprised of multiple rows and at least one column corresponding to the join column in each satellite table. A join operation is performed on the rows of the satellite tables to generate concatenated rows of the satellite tables. One of the concatenated rows is joined to the multi-column table and a returned entry from the multi-column table is received. A determination is then made as to whether the returned entry matches the search criteria. If so, a determination is made as to whether one of the satellite tables has duplicates of values in the join column of the returned matching entry or the multi-column table has duplicate entries in the join columns. Returned matching entries are generated for each duplicate value in the satellite tables and duplicate entry in the multi-column table.
In further embodiments, determining whether the satellite table and multi-column table have duplicate values and entries and generating the returned matching entries comprises determining the satellite tables including join columns and a number of matching values for the join column in the returned matching entry. A number of instances of the returned matching entry in the multi-column table is then determined. A further determination is made of all possible combinations of the number of matching entries in the multi-column table and the matching values in each satellite table. The matching entry is returned for each possible combination of the matching entries and matching values.
In still further embodiments, determining all the possible combinations comprises generating at least one nested loop to iterate through the matching values in each determined satellite table and all the matching entries in the multi-column table. The matching entries are returned for each iteration of each of the nested loops.
Still further, the multi-column table may be the outermost table in the nested loop with respect to all the matching values of the satellite tables. Alternatively, the multi-column table may be an inner table with respect to at least one satellite table in the nested loops.
With preferred embodiments, reposition logic may be used to skip values in the space defined by the concatenation of the dimension tables even in the case where the dimension tables include duplicate values. In the case where the dimension tables include duplicate values, additional instances of the matching table entry are returned for the additional instances of the value in the dimension table. Without this logic to specifically process duplicate values in the dimension tables, matching table entries would not be returned for the duplicate values.