The present invention relates to optimizing database queries, and in particular to using one or more join indexes to eliminate one or more table join operations.
In a database management system (DBMS), data is stored in one or more data containers. Each container contains records. The data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns.
In typical database systems, users store, update and retrieve information by submitting commands to a database server. To be correctly processed, the commands must comply with the database language that is supported by the database server. One popular database language is known as Structured Query Language (SQL). A command that retrieves data from the database is called a query.
A join is a query that combines rows from two or more tables. A join is performed whenever multiple tables appear in an SQL query""s FROM clause. The columns of data retrieved from the tables are listed in the SQL query""s SELECT clause. The query""s SELECT list can include any columns from any of the base tables listed in the FROM clause. In SQL, a condition predicate (xe2x80x9ca predicatexe2x80x9d), which is a condition that must be satisfied by the rows from which the data are retrieved, may be specified in a WHERE clause. Most join queries contain WHERE clause conditions that compare two columns, each from a different table. Such a condition is called a join condition. To execute a join, the DBMS combines pairs of rows for which the join condition evaluates to TRUE, where each pair contains one row from each table.
To execute a join of three or more tables, the DBMS often joins two of the tables based on the join conditions comparing their columns and then joins the result to another table based on join conditions containing columns of the joined tables and the new table. The DBMS continues this process until all tables are joined into the result. Joins involving many tables, or a table with many rows, can quickly explode into a very large number of row combinations. Such joins can cause a database server to consume considerable computational resources.
To improve performance of queries involving joins, a join index may be formed. A database index is conceptually similar to a normal index found at the end of a book, in that both kinds of indexes comprise a list of information accompanied with the location of the information. Values in one or more columns of a table are stored in an index along with data indicating one or more rows in the table having those values. The index is often maintained separately from the actual database table. A join index is an index on a table T for a column of a different table D related to table T through a join. The column from table D is the index key column. The index key column is typically different from the column or columns used in the join condition. The join index stores data uniquely indicating rows of T that are associated with each value of the index key column, subject to the join conditions that are used in the index definition.
Join indexes can be used to greatly reduce the volume of data that must be joined to satisfy a query by reducing the number of rows of T that are joined. For example, if the table T has 100,000 rows equally distributed across 10 values for the index key column, then a join that constrains the index key column to have a particular value, will list only 10,000 rows in the index associated with the particular value. Then only those 10,000 rows of table T are joined to the table D.
Multiple join indexes can reduce the volume of data that must be joined even further. For example, rows of table T may be associated with values in an index key column of table D1 in a first join index and with values in a index key column of table D2 in a second join index. If table T has its 100,000 rows equally distributed across the 10 values for the index key column in both the first join index and the second join index, then a join which constrains the index key columns of both join indexes to have respective particular values will only list 10,000 rows from each join index with the respective particular values. The list of 10,000 rows of table T from the first join index and the list of 10,000 rows of table T from the second join index are compared, and only rows that appear in both lists satisfy both constraints. If the lists are uncorrelated, only 1000 rows will appear in both lists, and only 1000 rows of table T will be joined back to tables D1 and D2 to perform the query.
Although computational savings are achieved using multiple join indexes, the computational resources still consumed can be extensive.
Based on the foregoing, there is a clear need to eliminate join operations whenever possible.
Techniques are provided for processing, in a database management system, a query that specifies a join between multiple tables of a database. According to one aspect of the invention, the query specifies a join between a first table and a second table, and the techniques include determining whether a join should actually be performed between the first table and the second table. The determination is based on (1) a particular set of one or more columns of the second table referenced in the query and (2) a join index on the first table for an index key column from the second table. If it is determined that the join should not be performed, the query is processed without performing the join between the first table and the second table.
According to an embodiment of this aspect, the step of determining whether the join should be performed includes determining whether a query execution operation preserves an association between a row in the first table and a value of the index key column.
According to another embodiment, the step of determining whether the join should be performed includes determining whether the query execution operation selects every row of the first table that satisfies all conditions in the query on the particular set of columns and selects no other row.
These techniques allow substantial savings in the consumption of computational resources to be achieved by avoiding one or more joins when possible.