1. Field of the Invention
The present invention relates to a method, system, and program for joining a multi-column table and at least two satellite tables and, in particular, determining the order of joining the satellite tables and multi-column table.
2. Description of the Related Art
Data records in a relational database management system (RDBMS) in a computer are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. One or more indexes may be associated with each table. An index an ordered set of pointers to data records in the table based on the data in one or more columns of the table. In some cases, all the information needed by a query may be found in the index, making it unnecessary to search the actual table. An index is comprised of rows or index entries which include an index key and a pointer to a database record in the table having the key column values of the index entry key. An index key is comprised of key columns that provide an ordering to records in a table. The index key columns are comprised of the columns of the table, and may include any of the values that are possible for that particular column. Columns that are used frequently to access a table may be used as key columns.
Organizations may archive data in a data warehouse, which is a collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. One data warehouse design implementation is known as star schema or multidimensional modeling. The basic premise of star schemas is that information is classified into two groups, facts and dimensions. A fact table comprises the main data base records concerning the organization""s key transactions, such as sales data, purchase data, investment returns, etc. Dimensions are tables that maintain attributes about the data in the fact table. Each dimension table has a primary key column that corresponds to a foreign key column in the fact table. Typically, the fact table is much larger than the related dimension tables.
The fact table typically comprises numerical facts, such as the date of a sale ,cost, type of product sold, location, site of sale, etc. The dimension table usually provides descriptive textual information providing attributes on one of the fact table columns. For instance, a time dimension table can provide attributes on the date column in the fact table describing the date of sale. The time dimension table may provide various weather conditions or events that occurred on particular dates. Thus, the time dimension table provides attributes on the time, i.e., weather, important events, etc., about data columns in the fact table.
The star schema provides a view of the database on dimension attributes that are useful for analysis purposes. This allows users to query on attributes in the dimension tables to locate records in the fact table. A query would qualify rows in the dimension tables that satisfy certain attributes or join conditions. The qualifying rows of the dimension tables have primary keys that correspond to foreign keys in the fact table. A join operation, such as an equijoin or natural join, is then specified to qualify rows of the fact table. Typically, the primary key columns of the dimension tables in the join result are compared against the corresponding foreign key columns in the Fact table to produce the equijoin results.
FIG. 1 illustrates an example of a star schema 2 with multiple dimension tables 4, 6, and 8 and a fact table 10. The fact table 10 includes sales data, wherein each record includes information on the amount sold in the AMOUNT column 12; the time of sale in the TID column 14, which includes a time identifier; the product sold in the PID column 16 which is a product identifier; and the location of the sale, e.g., store location, in the GID column 18, which is a geographic identifier. The dimension tables 4, 6, and 8 provide attributes on the TID 14, PID 16, and GID 18 columns in the fact table.
The primary key columns of each of the dimension tables 4, 6, 8 are the TID column 20, PID column 28, and GID column 36, respectively. The columns 14, 16, and 18 in the fact table 10 are foreign keys that correspond to primary keys 20, 28, and 36 of the dimension tables 4, 6, 8 that provide attributes on the data in the fact table 10. For instance dimension table 4 provides attributes for each possible TID value, including month information in column 22, quarter of the TID in the quarter column 24, and the year of the TID in the year column 26. Dimension table 6 provides product attributes for each PID value, including the product item in item column 30, the class of the product in the class column 32, and the inventory location of the product in inventory column 34. The dimension table 8 provides attributes for each possible GID value, including the city of the GID in the city column 38, the geographical region in the region column 40, and the country in the country column 42.
Much effort has been expended in developing optimization techniques to select the best possible join ordering for queries in relational database systems. The order in which the joins are performed has a substantial impact on query performance. Each possible plan for executing an SQL statement is an access plan. The choice of an access plan among the many possible such plans has a substantial effect on performance during execution of the query and joining of tables. The number of possible joins to consider grows exponentially as tables are added to the query. Star schemas involve a large number of dimension tables in the join. Thus, a query optimizer would have to consider perhaps millions of possible permutations from which to select the optimal join order. Further, if a database program has many different join algorithms, then the query optimizer will have to analyze performance not only for every possible join permutation, but also for every possible join algorithm with every possible join permutation.
Most optimizers are cost based because they operate by generating a list of access plans, comparing their costs, and then selecting a least cost plan. Current cost based query evaluation techniques experience significant difficulties when used to evaluate a query involving numerous tables because the number of permutations or orderings to consider expands exponentially as the number of tables involved in the query increases. Many of these query evaluation techniques require significant processing time and memory usage to determine the optimal search plan.
One common query evaluation plan is to use dynamic programming algorithms, which often are difficult to infeasible or extremely consuming to process if many tables, e.g., ten tables or more, are involved in the join operation. The article entitled xe2x80x9cOptimization of Large Join Queries: Combining Heuristics and Combinatorial techniques,xe2x80x9d by Arun Swami, in the ACM SIGMOD Record Vol. 18, No. 2, pgs. 367-376 by the Association for Computing Machinery (ACM Copyright 1989), discusses problems with dynamic programming query evaluation techniques as the number of tables involved in the query exceeds ten. This article is incorporated herein by reference in its entirety. The commonly assigned U.S. Pat. No. 5,301,317, entitled xe2x80x9cSystem for Adapting Query Optimization Effort to Expected Execution Time,xe2x80x9d which is incorporated herein by reference in its entirety includes further discussion of dynamic programming query evaluation plans and their computational complexity and performance problems.
Other query evaluation techniques employ heuristic approaches to limit the search space when selecting an optimal search. Further, certain approaches use global optimization strategies to select a strategy that matches certain predefined criteria. Such techniques use substantially less processing cycles to select a query plan than the dynamic approach which requires consideration of all or most of the possible access paths. However, heuristic and global optimization techniques do not have the means for dynamically varying the search space and may not select the most desired join order plan.
There is thus a need in the art for an improved system, method, and program for selecting an optimal query plan or ordering of the join tables in a join operation.
To overcome the limitations in the prior art described above, preferred embodiments disclose a system, method, and program for joining a multi-column table and at least two satellite tables. Each satellite table is comprised of multiple rows and at least one join column and each multi-column table is comprised of multiple rows and join columns corresponding to the join columns in the satellite tables. A query including predicates is received. A join predicate column comprises the satellite table and multi-column table join column to which at least one query predicate applies. A determination is then made as to whether there is at least one index on the multi-column table including at least one column for one join predicate column. One index is selected. The ordering of the join predicate columns in the selected index is used to determine the join order of the satellite tables and the multi-column table. The satellite tables and multi-column tables are then joined in the determined join order.
In further embodiments, there are multiple indexes on the multi-column table, each including at least one column corresponding to one join predicate column. In such case, selecting one index comprises determining the join order for an index by using the ordering of the join predicate columns. Then, the cost of performing the join operation is estimated using the join order for each index. The index producing the join order having the lowest cost of the estimated costs for the determined join orders is selected to determine the join order and join processing.
In yet further embodiments, the join order comprises the satellite tables having join predicate columns ordered according to the order of the join predicate columns in the index. The multi-column table then follows the last satellite table in the join order.
Preferred embodiments provide a heuristic type program to determine join orders based on the indexes on the fact table. A cost estimate is then made for each of the join orders based on the indexes. The best cost join order is then selected to perform the query join. Preferred embodiments utilize a goal oriented heuristic algorithm to select different join orders to consider. Prior art methods, on the other hand, often consider the cost of all combinations of join orders. Preferred embodiments conserve substantial processing time by cost analyzing far fewer join orderings than current method which consider many combinations of the star joins. With the preferred ordering technique, the fact table or large multi-column table is only accessed once for any given query join. This avoids the need to use exhaustive query evaluation search techniques that consider many possible join orderings in the search space to select a best performing join ordering. Preferred embodiments, on the other hand, insure a result where the fact table is joined once, or some other minimal number of times, and wherein indexes are used to determine the ordering of the joins.
Because preferred embodiments consider the cost of a limited number of join orders, the preferred embodiments avoid long query compilation time and minimize storage usage during query compilation.