1. Technical Field
This invention relates to relational database systems. More specifically, the invention relates to optimizing snow flake queries.
2. Description of the Prior Art
Relational database systems store large amounts of data, including business data that can be analyzed to support business decisions. Data records in a relational database management system in a computing system are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. Keys can be defined on a column or set of columns if the column or set of columns can uniquely identify each row of the table. In particular, one primary key can be defined on each table, which represents the primary identifier for each row of the table. A foreign key can be defined on one table which refers to the primary key of another table. A table can have multiple foreign keys defined, each of which refers to the primary key of a different table. Two tables can be joined together via the primary key—foreign key relationship, wherein this join will bring information from these two tables together.
Organizations are known to archive data in a data warehouse, which is a collection of data designed to support management decision making. Data warehouses contain a wide variety of data that present a coherent picture of business conditions at a single point in time. One data warehouse design implementation is known as a “star schema”, also known as multidimensional modeling. The basic premise of star schema is that information is classified into two groups, facts and dimensions. A fact table comprises the main database records concerning the organization's key transactions, such as sales data, purchase data, investment returns, etc. Dimensions are tables that maintain attributes about the data in the fact table. Each dimension table has a primary key column that corresponds to a foreign key column in the fact table.
The star schema provides a view of the database on dimension attributes that are useful for analysis purposes. This allows users to query on attributes in the dimension tables to locate records in the fact table. A query would qualify rows in the dimension tables that satisfy certain attributes or join conditions. The qualifying rows of the dimension tables have primary keys that correspond to foreign keys in the fact table. A join operation is then specified to qualify rows of the fact table. Typically, the primary key columns of the dimension tables are compared against the corresponding foreign key columns in the fact table to produce the results.
FIG. 1 is a prior art diagram (10) of a star schema. The schema is “star like” with a central fact table (12), and peripheral dimension tables (22), (32), and (42). The central fact table (12) includes sales data, wherein each record includes information on the amount sold in the AMOUNT column (20), the time of the sale in the TIME_ID column (14), the product sold in the PROD_ID column (16), and the geographic region of the sale in the NATION_ID column (18). The dimension tables (22), (32), and (42) provide attributes on the TIME_ID column (14), PROD_ID column (16), and NATION_ID column (18) in the fact table. The primary key columns of each of the dimension tables (22), (32), and (42) are the TIME_ID column (24), PROD_ID column (34), and NATION_ID column (44), respectively. The columns (14), (16), and (18) in the fact 25 table (12) are foreign keys that correspond to primary keys (24), (34), and (44) of the dimension tables (22), (32), and (42) that provide attributes on the data in the fact table (12). Accordingly, a join operation of the fact table (12) and the nation dimension table (42) with the NATION_ID as the join key will bring necessary information from the nation dimension table (42) together with information from the fact table (12).
A “snow flake schema” is an extension of the star schema where there can be multiple levels of dimension tables. FIG. 2 is a prior art diagram (60) of a snow flake schema. As in the star schema of FIG. 1, there is a central fact table (12), and peripheral dimension tables (22), (32), and (42). In addition, there is a second level of dimension tables (62) and (72) which provide attributes on the MANU_ID column (36) of dimension table (32) and the REGION_ID column (46) of dimension table (42). A level for a table in a snow flake schema is defined as the number of join operations it needs to reach the fact table. As shown in FIG. 2, the fact table is at level 0, dimension tables (22), (32), and (42) are at level 1, and dimension tables (62) and (72) are at level 2. The primary key columns of the second level dimension tables (62), (72) are the MANU_ID column (64) and REGION_ID column (74), respectively. Column (36) in table (32) and column (46) in table (42) are foreign keys that correspond to primary keys (64) and (74) of dimensional tables (62) and (72), respectively. Accordingly, a join operation of the second level dimension tables (62) and (72), with the first level dimension tables (32) and (42), followed by a join operation with the fact table (12) will bring information from the two levels of dimension tables together with information from the fact table (12).
The fact table in a relational database is the largest table in the structure. As such, processing the fact table is resource intensive when compared to processing other tables.
A query in a star schema or a snow flake schema usually has predicate filters on dimension tables and not on the fact table. A common technique for processing a query is to “push down” filter selectivity from dimension tables to the fact table. In a snow flake schema, filter selectivity from higher level dimension tables can be pushed down to the fact table via intermediate push down to lower level dimension tables. Predicate filters on a dimension table only select a subset of rows from the dimension table, and the technique of push down enables processing of only rows from the fact table corresponding to the qualifying rows from the dimension table. However, one must balance the cost of conducting a push down and the benefit of this push down to determine whether a push down from this dimension table to the fact table will be beneficial. The cost benefit analysis of conducting a push down depends on the selectivity of the predicate filter of the dimension table. A small selectivity indicates that a small number of rows will qualify based on this filter. Accordingly, the smaller the selectivity the dimension table has, the more beneficial it is to push down this filter selectivity to the fact table.
Much effort has been expended in developing optimization techniques for conducting queries in a snow flake schema. FIG. 3 is a prior art flow chart (100) illustrating a known method for a query in a snow flake schema. The initial step in the process is to input a query into the database (102). Following the input of the query into the database, a test is conducted to determine if the query is a snow flake query (104). If the query is not a snow flake query, an alternative optimization technique will be conducted (106). However, if the query is a snow flake query, a level is assigned to each table in the query (108). The initial level for the query optimization process is set to one less than the maximum number of levels of tables in the query (110). Thereafter, for each table at the current level the following steps are followed (112): the current table is assigned as the parent table (114), and all child dimension tables are ordered for the current parent table based on selectivity (116). Following the ordering of the child dimension tables, for each child dimension table of the current parent table the following steps are followed (118): a test is conducted to determine if the child dimension table should be pushed down based on selectivity (120), a positive response to the test at step (120) will result in the marking of the child dimension table for push down (122), followed by a subsequent test to determine if the pushed down child dimension table is the last child dimension table of the current parent table (124). A negative response to the test at step (124) will result in proceeding to the next child dimension table for the current parent table (126) followed by a return to step (120). However, a positive response to the test at step (124) as well as a negative response to the test at step (120) will exit the loop initiated at step (118) and will result in a subsequent test to determine if the parent table acted upon at step (114) is the last table at the specified level in the query (128). A negative response to the test at step (128) will proceed to the next table at the current level (130) followed by a return to step (114). However, a positive response to the test at step (128) will result in proceeding to the next lower level in the query structure (132). Thereafter, a test is conducted to determine if the next lower level in the query structure is less than zero (134), which is an indication that all tables in the query have been exhausted. A negative response to the test at step (134) will return to step (112), whereas a positive response to the test at step (134) will complete the query optimization process (136). Accordingly, the above outlined process for a query optimization in a snow flake schema initiates the query optimization at the peripheral nodes of the query structure and proceeds toward the central node, also known as the fact table.
As shown in FIG. 3, the prior art method for query optimization in a snow flake schema conducts the node selection process for a query at the peripheral nodes. However, this method is not efficient in the selection process in that it requires selection of each child dimension table on an individual basis. Accordingly, what is desirable is a time and resource efficient system and method for optimizing a query in a snow flake schema.