1. Field of the Invention
The present invention relates to a method, system, and program for performing a join operation on a multi-column table and satellite tables with a join condition and, in particular, joining multiple dimension tables with a fact table in a star join.
2. Description of the Related Art
Data records in a computer database are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. One or more indexes may be associated with each table. An index is an ordered set of pointers to data records in the table based on the data in one or more columns of the table. In some cases, all the information needed by a query may be found in the index, making it unnecessary to search the actual table. An index is comprised of rows or index entries which include an index key and a pointer to a database record in the table having the key column values of the index entry key. An index key is comprised of key columns that provide an ordering to records in a table. The index key columns are comprised of the columns of the table, and may include any of the values that are possible for that particular column. Columns that are used frequently to access a table may be used as key columns.
Organizations may archive data in a data warehouse, which is a collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. One data warehouse design implementation is known as star schema or multidimensional modeling. The basic premise of star schemas is that information is classified into two groups, facts and dimensions. A fact table comprises the main data base records concerning the organization""s key transactions, such as sales data, purchase data, investment returns, etc. Dimensions are tables that maintain attributes about the data in the fact table. Each dimension table has a primary key column that corresponds to a foreign key column in the fact table. Typically, the fact table is much larger than the related dimension tables.
The fact table typically comprises numerical facts, such as the date of a sale, cost, type of product sold, location, site of sale, etc. The dimension table usually provides descriptive textual information providing attributes on one of the fact table columns. For instance, a time dimension table can provide attributes on the date column in the fact table describing the date of sale. The time dimension table may provide various weather conditions or events that occurred on particular dates. Thus, the time dimension table provides attributes on the time, i.e., weather, important events, etc., about data columns in the fact table.
The star schema provides a view of the database on dimension attributes that are useful for analysis purposes. This allows users to query on attributes in the dimension tables to locate records in the fact table. A query would qualify rows in the dimension tables that satisfy certain attributes or join conditions. The qualifying rows of the dimension tables have primary keys that correspond to foreign keys in the fact table. A join operation, such as an equijoin or natural join, is then performed on the qualifying rows of the joined dimension tables and the fact table. This join results in returning fact table entries that match the rows of the joined dimension tables, i.e., fact table entries that satisfy the search criteria on the dimension tables. Thus, join operations are used to query a fact table on dimension table attributes.
A join operation combines or concatenates the rows from the different dimension tables according to a condition or predicate to determine values to apply against the fact table. This is distinguished from a Cartesian product which concatenates every row from one table with every other row from another table without regard to a condition or predicate to exclude rows from the result. Rows from the tables involved in a join operation that do not satisfy the predicate or condition are excluded from the join result. Because the dimension tables are unrelated, the rows of the dimension tables that satisfy the join condition are then concatenated in every possible combination.
The Cartesian product of the rows of the dimension tables provides a data view of the entire space, i.e., Cartesian space, of every possible combination of the possible dimension table values. The join result, on the other hand, is a subset of the Cartesian space that is limited to those Cartesian space points that satisfy the join or search condition. One common type of join operation is an equijoin. An equijoin combines two rows from different tables that are equal according to some attribute. Once the combination of all dimension table rows that satisfy the search criteria is produced, the resulting rows are then applied to the Fact table in an equijoin operation to locate those rows in the fact table that have the same values as the rows resulting from the join on the dimension tables. Typically, the primary key columns of the dimension tables in the join result are compared against the corresponding foreign key columns in the Fact table to produce the equijoin results.
In multi-dimensional analysis, it is often desirable to form a query on the attributes specified in the dimension tables and then locate all records in the fact table that satisfy the criteria on the dimension table attributes. To perform such a query, the query engine joins the dimension tables on the conditions specified in the search criteria. The query engine then equijoins the dimension tables with the fact table to produce join results that satisfy the join condition.
The above query technique using join operations is very inefficient because the results of the join operation on the dimension tables may produce numerous concatenations that do not exist in the fact table. In fact, it has been found that on average only 1% or less of the concatenated results of the join operation on the dimension tables have corresponding matching entries in the Fact table that would concatenate in an equijoin operation. Nonetheless, prior art techniques would attempt to join all of the join results from the dimension tables to the fact table even though many of these attempted joins would not produce results as less than 1% of the concatenated results of the dimension table joins have corresponding matches in the fact table. Thus, numerous join operations are performed for which there will be no join result, thereby needlessly consuming I/O operations to perform the non-matching join operations.
FIG. 1 illustrates an example of a star schema 2 with multiple dimension tables 4, 6, and 8 and a fact table 10. The fact table 10 includes sales data, wherein each record includes information on the amount sold in the AMOUNT column 12; the time of sale in the TID column 14, which includes a time identifier; the product sold in the PID column 16 which is a product identifier; and the location of the sale, e.g., store location, in the GID column 18, which is a geographic identifier. The dimension tables 4, 6, and 8 provide attributes on the TID 14, PID 16, and GID 18 columns in the fact table.
The primary key columns of each of the dimension tables 4, 6, 8 are the TID column 20, PID column 28, and GlD column 36, respectively. The columns 14, 16, and 18 in the fact table 10 are foreign keys that correspond to primary keys 20, 28, and 36 of the dimension tables 4,6, 8 that provide attributes on the data in the fact table 10. For instance dimension table 4 provides attributes for each possible TID value, including month information in column 22, quarter of the TID in the quarter column 24, and the year of the TID in the year column 26. Dimension table 6 provides product attributes for each PID value, including the product item in item column 30, the class of the product in the class column 32, and the inventory location of the product in inventory column 34. The dimension table 8 provides attributes for each possible GID value, including the city of the GID in the city column 38, the geographical region in the region column 40, and the country in the country column 42.
To locate records in the fact table 10 with a query on the attributes of multiple dimension tables 4, 6, and 8, the query engine would first join the dimension tables 4, 6, and 8 according to conditions specified on the search criteria of the query. The results of the join of the dimension tables would be equijoined with the Fact table to find rows in the fact table 10 that match the attributes in the rows of the joined dimension table on time 4, product 6, and geographic location 8. The number of comparisons of the rows formed from the joined dimension tables could require a vast magnitude of calculations. For instance, if the dimension table values that satisfied the search criteria included 60 time values, 50,000 product values, and 1,000 geographical locations, then the concatenation of these rows in a join operation would produce 3 billion possible values to apply against the fact table 10 in an equijoin, even though likely only 1% of the entries in the fact table 10 would be concatenated in an equijoin.
There is thus a need in the art for an improved method for performing star join queries on multiple dimension tables.
To overcome the limitations in the prior art described above, preferred embodiments disclose a system, method, and program for performing a join operation on a multi-column table and at least two satellite tables having a join condition. Each satellite table is comprised of multiple rows and at least one join column. Each multi-column table is comprised of multiple rows and at least one column corresponding to the join column in each satellite table. A join operation is performed on the rows of the satellite tables to generate a first join result comprising concatenations of the rows of the satellite tables. Less than all the concatenated rows in the first join result that satisfy the join condition are then selected. The selected concatenated rows in the first join result are then joined to the multi-column table.
In further embodiments, the satellite tables or multi-column table may comprise an index ordered on the join columns. If the tables are not an index ordered on join columns, then the tables may be sorted to provide ordering on the join columns.
In still further embodiments, the operation of selecting less than all the first join results to join to the primary table comprises receiving a feed back entry from the fact table in response to the join operation of one of the selected concatenated rows to the fact table. The feed back entry is then processed to select the next concatenated row from the first join results. The next selected concatenated row is then joined to the primary table.
In yet further embodiments, the feedback entry is the next ordered entry in the multi-column table following the concatenated row used in the join operation to the primary table according to the ordering of the join columns in the multi-column table. In such case, processing the feedback entry to select the next concatenated row of the satellite tables from the first join results comprises selecting a concatenated row from the first join results that follows the feedback entry according to the ordering of the join columns in the satellite tables.
Preferred embodiments provide an algorithm for searching a fact table to determine fact table rows that satisfy criteria of qualified dimension rows. Preferred embodiments utilize actual feedback values from the columns in the rows of the fact table to select concatenated rows of the dimension table to apply in join operations with the fact table. In this way, the preferred embodiments avoid searching on non-existent rows in the fact table because the search keys are adjusted to reflect actual rows in the fact table being searched.