Computer systems typically comprise a combination of computer programs and hardware, such as semiconductors, transistors, chips, circuit boards, storage devices, and processors. The computer programs are stored in the storage devices and are executed by the processors. Fundamentally, computer systems are used for the storage, manipulation, and analysis of data.
One mechanism for managing data is called a database management system (DBMS) or simply a database. Many different types of databases are known, but the most common is usually called a relational database (RDB), which organizes data in tables that have rows, which represent individual entries, tuples, or records in the database, and columns, fields, or attributes, which define what is stored in each entry, tuple, or record. Each table has a unique name within the database and each column has a unique name within the particular table. The database also has one or more indexes, which are data structures that inform the DBMS of the location of a certain row in a table given an indexed column value, analogous to a book index informing the reader of the page on which a given word appears.
The most common way to retrieve data from a database is through statements called database queries, which may originate from user interfaces, application programs, or remote computer systems, such as clients or peers. A query is an expression evaluated by the DBMS, in order to retrieve data from the database that satisfies or meets the criteria or conditions specified in the query. Although the query requires the return of a particular data set in response, the method of query execution is typically not specified by the query. Thus, after the DBMS receives a query, the DBMS interprets the query and determines what internal steps are necessary to satisfy the query. These internal steps may comprise an identification of the table or tables specified in the query, the row or rows selected in the query, and other information such as whether to use an existing index, whether to build a temporary index, whether to use a temporary file to execute a sort, and/or the order in which the tables are to be joined together to satisfy the query. When taken together, these internal steps are referred to as an execution plan. The DBMS often saves the execution plan and reuses it when the user or requesting program repeats the query, which is a common occurrence, instead of undergoing the time-consuming process of recreating the execution plan.
Many different execution plans may be created for any one query, each of which would return the same data set that satisfies the query, yet the different execution plans may provide widely different performance. Thus, the execution plan selected by the DBMS needs to provide the required data at a reasonable cost in terms of time and hardware resources. Hence, the DBMS often creates multiple prospective execution plans and then chooses the best, fastest, or least expensive one, to execute. One factor that contributes to the cost of a particular execution plan is the number of rows that the execution plan, when executed, returns from the database tables. One important aspect that influences the number of rows processed is the join order of the tables. In response to a query that requests data from multiple tables, the DBMS joins rows from these multiple tables (the rows are often concatenated horizontally into a result set), in order to find and retrieve the data from all the tables. Thus, a join operation is a relationship between two tables accessed by a query (a join query), and a join operation is performed to connect (or join) data from two or more tables, wherein the DBMS joins rows with particular attributes together to form a new row that the DBMS saves to the result set. The join order is typically specified by the execution plan and is the order in which the DBMS performs join operations when the DBMS executes the query via the execution plan, to retrieve and join rows of data from the database tables into the result set.
Join operations are typically implemented using a nested loop algorithm, where the resultant new rows from the first two tables in the join order are joined to the resultant rows from the third table, and those results are joined to the fourth table, etc. Eventually all of the needed join operations are complete, and the resultant new rows are stored to the result set that satisfies the query. Because a single join is limited to accessing two tables, multi-table joins are performed in sequence according to a particular order. Many different join queries may be implemented by joining the tables in any of several possible join orders. For example, a query that involves joining tables A, B, and C may be performed as a join of tables A and B followed by a join of the results of table A joined to table B and table C. Alternatively, the same query may be performed as a join of tables A and C followed by the join of the results of table A joined to table C and table B. The DBMS attempts to select a join order that eliminates the greatest number of rows from the potential result set early in the join processing, which saves the costs associated with repeatedly accessing tables later in the join operation.
The DBMS often evaluates certain characteristics about the tables A, B, and C, in an attempt to determine the best join order for the query. In particular, during runtime, one join operation may have a high fan-out rate in which each row of table A matches multiple rows in table B. If this join is performed first, then each of these matching rows will need to be joined to table C, thereby requiring a significant number of intermediate operations. Conversely, the other join operation may have a high fan-in rate in which each row of table A matches very few or zero rows in table C. If this join operation is performed first, then only a few rows need to be joined with table B, thereby saving a number of intermediate operations.