Computer systems typically comprise a combination of hardware, such as semiconductors and circuit boards, and computer programs. Fundamentally, computer systems are used for the storage, manipulation, and analysis of data, and the overall value or worth of a computer system depends largely upon how well the computer system stores, manipulates, and analyzes data. One mechanism for managing data is called a database management system (DBMS), which may also be called a database system or simply a database.
Many different types of databases are known, but the most common is usually called a relational database (RDB), which organizes data in tables that have rows, which represent individual entries, tuples, or records in the database, and columns, fields, or attributes, which define what is stored in each entry, tuple, or record. Each table has a unique name within the database and each column has a unique name within the particular table. The database also has an index, which is a data structure that informs the database management system of the location of a certain row in a table given an indexed column value, analogous to a book index informing the reader on which page a given word appears.
To be useful, the data stored in databases must be capable of being retrieved in an efficient manner. The most common way to retrieve data from a database is through statements called database queries, which may originate from user interfaces, application programs, or remote computer systems, such as clients or peers. A query is an expression evaluated by the database management system, in order to retrieve data from the database that satisfies or meets the criteria or conditions specified in the query. Although the query requires the return of a particular data set in response, the method of query execution is typically not specified by the query. Thus, after the database management system receives a query, the database management system interprets the query and determines what internal steps are necessary to satisfy the query. These internal steps may comprise an identification of the table or tables specified in the query, the row or rows selected in the query, and other information such as whether to use an existing index, whether to build a temporary index, whether to use a temporary file to execute a sort, and/or the order in which the tables are to be joined together to satisfy the query.
When taken together, these internal steps are referred to as an execution plan, a query plan, or just a plan. The execution plan is typically depicted as a tree graph and is typically created by a component of the database management system that is often called a query optimizer. The query optimizer may be part of the database management system or separate from, but in communication with, the database management system. When a query optimizer creates an execution plan for a given query, the execution plan is often saved by the database management system in the program object, e.g., the application program, that requested the query. The execution plan may also be saved in an SQL (Structured Query Language) package or an execution plan cache. Then, when the user or program object repeats the query, which is a common occurrence, the database management system can find and reutilize the associated saved execution plan instead of undergoing the expensive and time-consuming process of recreating the execution plan. Thus, reusing execution plans increases the performance of queries when performed by the database management system.
Many different execution plans may be created for any one query, each of which would return the same data set that satisfies the query, yet the different execution plans may provide widely different performance. Thus, the execution plan selected by the database management system needs to provide the required data at a reasonable cost in terms of time and hardware resources. Hence, the query optimizer often creates multiple prospective execution plans and then chooses the best, fastest, or least expensive one, to execute.
One factor that contributes to the cost of a particular execution plan is the number of rows or tuples that the execution plan, when executed, returns from the database tables. One important aspect that influences the number of tuples returned is the join order of the tables. In response to a query that requests data from multiple tables, tuples from these multiple tables are joined (the tuples are often concatenated horizontally in a result set), in order to find and retrieve the data from all the tables. Thus, a join operation is a relationship between two tables accessed by a query (a join query), and a join operation is performed to connect (or join) data from two or more tables, wherein tuples with matching attributes are joined together to form a new tuple. The join order is typically specified by the execution plan and is the order in which the database management system performs join operations when the database management system executes the query via the execution plan to retrieve and join rows of data from the database tables into the result set.
Join operations are typically implemented using a nested loop algorithm, where the resultant new tuples from the first two tables in the join order are joined to the resultant tuples from the third table, and those results are joined to the fourth table, etc. Eventually all of the needed join operations are complete, and the resultant new tuples are the result set that satisfies the query.
Because a single join is limited to accessing two tables, multi-table joins are performed in sequence according to a particular order. Many different join queries may be implemented by joining the tables in any of several possible join orders. For example, a query that involves joining tables A, B, and C can often be performed as a join of table A and B followed by a join of table A and C. Alternatively, in many instances, the same query can be performed as a join of table A and C followed by the join of table A and B. The query optimizer attempts to select a join order that will eliminate the greatest number of records from the potential result set early in the join processing, which saves the costs associated with repeatedly accessing tables later in the join operation.
Conventional query optimizers evaluate certain characteristics about the tables A, B, and C, in an attempt to determine the best join order for the query. In particular, during runtime, one join operation may have a high fan-out rate in which each record of table A matches multiple records in table B. If this join is performed first, then each of these matching records will need to be joined to table C, thereby requiring a significant number of intermediate operations. Conversely, the other join operation may have a high fan-in rate in which each record of table A matches very few or zero records in table C. If this join operation is performed first, then only a few records need to be joined with table B, thereby saving a number of intermediate operations.