Computer systems are typically used to process data represented internally as one or more data objects. Examples of data objects include database objects in a database management system as well as images, audio tracks, flat files, and documents in various storage formats known in the art. Publicly available formats for document storage known in the art include the Hypertext Markup Language (HTML) and the extensible markup language (XML). Some data processing tasks involve operations that combine or “join” data from several data objects. In some circumstances the join can consume considerable computational resources, such as computer memory, processing cycles on a central processing unit (CPU), storage, and communication channel bandwidths. The use of computational resources can sometimes be reduced if one or more portions of a large data object can be eliminated from the join steps of an operation.
A common context in which the consumption of resources during a join step is severe is in a query for data from multiple database tables that are related according to a star schema. A table is a data object that represents data in rows and columns. Each column represents an attribute or field with a certain kind of data that has a particular form or meaning or both. Each row represents a record with a value for each of the columns, although some values may be undefined values indicated by the symbol NULL. In a star schema, a database table with a large number of rows, called a “fact table,” contains much of the data in a database, and several database tables with many fewer rows, called “dimension tables,” each contain ancillary information about the entries for a corresponding column in the fact table. The entry in the corresponding column of the fact table is the value for a foreign key that uniquely identifies one row in the dimension table associated with that column. The values in other columns of the row in the dimension table provide the ancillary information associated with a key value. A star schema is typically used to avoid repeating information in the dimension tables with every row in the fact table.
For example, a fact table called “Sales” has a row for each sale made by a large retailer of electronic components. The Sales table can be expected to accumulate millions of rows. These many sales are made at relatively few stores. The location, manager, and phone number of the stores are of interest, but it is not desirable to repeat the store street address, city, state, zip code, phone number and manager information of each store for the tens of thousands of sales made at that store. Instead, a unique store identifier (store ID) is defined as a primary key in a column called “stores_id” of a dimension table called “Stores”; and, for each distinct store, a row is added to the Stores table with a unique value in the store_id column. Each row also contains values in other columns that represent street address, city, state, zip code, phone number and manager of the store. The store_id column in the Stores table may be referenced as “Stores.store_id.”
In this example, a single foreign key column in the Sales table is defined that holds the value of the store_id for the store where the sale is made. For simplicity of illustration the column is also called “store_id,” but may have a different name in other circumstances. The column may be referenced as “Sales.store_id.” The data sharing a row in the Stores table with a particular value of the store_id is also associated with every row of the Sales table that has that same value in the Sales.store_id column. The size of the Sales table and Stores table added together is much smaller than the size the Sales table would be if multiple columns describing each store were included in each row of the Sales table; yet the database holds the same amount of information.
A star query joins data from a fact table and one or more of the dimension tables. For example, a query requesting columns from the Sales table and Stores table for sales made in Indiana during July, involves both the information in the Stores table and the information in the Sales table and the information in the Time table. It is a star query that joins tables. A database management system would join the three tables to determine which rows of the Sales table are associated with stores in Indiana and times in July and what values of the Stores table and Time table are associated with each of those rows.
Because there are so many rows in the Sales table, it is desirable to avoid scanning every row of the Sales to respond to this query. It is preferable to identify a subset of the rows of the Sales table and scan only that subset. Because a join step consumes so many resources, it is desirable to avoid joining every row scanned. It is preferable to identify a subset of the rows scanned from the Sales table and involve only that subset in a join with the Stores table.
In one past approach, a join index is generated that identifies rows in the fact table that have a certain value in the dimension table. If a particular join index is generated with the Stores.state column as the index key, the join index can be used to identify rows in the Sales table that have the value of “Indiana.” Then the subset of rows can be joined to the Stores table, greatly reducing the consumption of computational resources. A disadvantage of this approach is that it consumes too many computational resources to form a join index for every column used in all possible star queries, and it is hard to predict which subset of columns are included in the criteria that will be used by future star queries. Consequently, there are very likely many queries that do not benefit from a pre-existing join index. It is not a solution to form the join index in response to receiving the particular query, because the resources consumed to form the join index probably exceed the resources consumed to scan the entire Sales table.
In another past approach, the database management system uses partitioning to avoid scanning every row in a table in some circumstance. With partitioning, a data object, such as a table, is divided up into multiple sub-tables referred to as “partitions” that are stored and indexed separately. A row is assigned to one partition or another based on the value of one or more columns that constitute a partitioning key. The database management system transforms the star query to include additional sub-queries on a single table that allow the database management system to avoid scanning any rows in one or more partitions of the Sales table. This approach is described in more detail in a patent application Ser. No. 09/795,904, filed on Feb. 27, 2001, by Hakan Jakobsson, Mohammed Zait and Benoit Dageville, titled “Method and Mechanism for Partition Pruning.”
For example, if the Sales table is partitioned with Sales.store_id as the partitioning key, rows for sales in the same store are placed in one partition of the Sales table and rows for each different store are stored in a corresponding different partition. Using this past partition pruning approach, the database management system transforms the star query requesting columns from the Sales table and Stores table for sales made in Indiana. The transformed query includes an additional sub-query to take advantage of the partitioning. The additional sub-query uses the information in one dimension table to avoid adding another join. The sub-query takes the form of selecting from the Stores table only values from the Stores.store_id column where the Stores.state column has a value that indicates “Indiana.” This portion of the query can be performed without a join step and produces a list of store_id values for stores in Indiana. A value for the Sales.store_id is then chosen from this list of Stores.store_id values. The database management system recognizes that the Sales table is partitioned by Sales.store_id; and, the database management system forms an execution plan that only scans the partitions corresponding to the chosen values of Sales.store_id. In many cases, one or more partitions of the Sales table can be eliminated from the join step of the query execution.
A disadvantage of this approach is that it is often costly to perform the sub-queries, especially where the table is partitioned on multiple keys for multiple dimension tables and multiple sub-queries are involved. If the number or size, or both, of the partitions avoided are not great enough, the extra cost in computational resources to plan and execute the sub-queries exceeds the reduction in cost by avoiding the few partitions.
In another approach, the costs of generating and performing the sub-queries and the benefits of avoiding joins with any partitions are estimated. The database management system does not generate the sub-queries when the cost of doing so is estimated to exceed the benefits. A problem with this approach is that it is difficult to estimate either the costs or benefits of using the sub-queries before the execution of the query. Computational resources are consumed to make the estimates; yet, the estimates are sometimes not accurate and cannot be relied upon in all circumstances.
Based on the problems and disadvantages described above, there is a clear need for techniques that reduce the portions of a data object that are included in joins for operations that involve several data objects.
In particular, there is a need for reducing the number of rows or partitions of a fact table that are joined with dimension tables during the planning and execution of a star query in a database management system.
The past approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.