A database system is a set of computer programs used by a user to store, retrieve, and manipulate data. Such data is often limited to text (letters, numerals, symbols, and other characters), but may include any data that may be stored by a computer. Most database systems store data in tables. A table is a series of rows, also called records. Each row contains data about a particular thing, such as a merchandise order. Data is typically retrieved from a table one row at a time. The rows are divided into columns. The intersection of a row and column is referred to as a field. Each column specifies a particular type of data that is contained in each field of the column. Each field contains the data of the particular type for the intersecting row and column.
FIG. 1 is a table diagram showing the contents of a sample Employees table containing data about each of several employees. The sample Employees table 100 has nine rows, each containing information about a different employee. The Employees table has columns as follows: an Employee ID column 101 uniquely identifying each employee, a Last Name column 102 containing each employee's last name, a First Name column 103 containing each employee's first name, and a Region column 104 which contains the region to which each employee is assigned. The Employee table also contains a bookmark column 110, which contains a value uniquely identifying each row in the table that can be used to quickly locate and retrieve the row. Such a value is known as a bookmark.
One way in which many databases can manipulate stored data is by joining two different tables. Joining is an operation in which a database generates a result table containing rows that are each a combination of a row from each of two existing tables. The existing tables that are joined are called source tables. A row of one source table is typically combined with a row the other source table if the rows have matching values in a join column that is part of each source table.
The combined rows of the source tattles form the result table. The result table formed by the join operation can be either an actual table containing the data from each of the joined rows of the source tables or a virtual table whose rows are generated by reading from the source tables on demand to satisfy query requests directed to the result table.
As an example, the Employees table described above may be joined to an Orders table, in which each row represents one sales order taken by one of the employees listed in the Employees table. FIG. 2 is a table diagram showing the contents of the Orders table. The Orders table 200 contains the following columns: an Order ID column 201 uniquely identifying each order, an Employee ID column 202 identifying the employee who took each order, and an Order Amount column 203 containing the dollar amount of each order. A bookmark column 210 is also shown that contains a bookmark for each row. Because both the Employees table and the Orders table contain an Employee ID column, the Employee ID column of each table can be used to join the two tables. The join produces a result table, each row of which corresponds to a row of the Orders table whose Employee ID field contents matches those of the Employee ID field of some row of the Employees table. This result table can be used to determine the name of the employee who took each order.
FIG. 3 is a table diagram showing the contents of the result table formed by joining the Employees table with the Orders table. The EmployeesJoinOrders table 300 contains the following columns: an Order ID column 301 and an Order Amount column 302 correspond to the Order ID column 201 and the Order Amount column 203, respectively, of the Orders table; an Employee ID column 303 corresponds to both the Employee ID column 202 of the Orders table and the Employee ID column 101 of the Employees table; and a Last Name column 304, a First Name column 305, and a Region column 306 correspond to the Last Name column 102, a First Name column 103, and a Region column 104, respectively, of the Employees table. A bookmark column 310 is also shown that contains a bookmark for each row. It can be readily seen that row I of the EmployeesJoinOrders table has been joined from row 1 of the Orders table (as the Order ID field of each contains "10000") and row 6 of the Employees table (as the Employee ID field of each contains "6").
Existing techniques for joining two or more tables have significant shortcomings. These include consuming inordinate levels of file retrieval resources and the inability to accommodate restrictions on the rows that are joined for each source table, as described below.
Two different joining techniques are commonly used to join two tables as described above. A first joining technique involves, for each row of the first source table, reading the row, extracting the join field from the read row, and searching the entire second source table for rows whose join field matches the join field extracted from the read row of the first source table. The first technique has the disadvantage that it proceeds very slowly, because it requires intensive data retrieval from the source tables. Because database systems are commonly compared on the basis of the speed with which they perform typical operations, the slowness of the first technique is a significant disadvantage.
Though the first joining technique is very slow, it is flexible: because it reads every row of both source tables, it can easily be adapted to apply restrictions during the join operation. A restriction is a condition based on the columns of at least one of the source tables that must be true in order for a row of the source table to be included in the join. As an example, a join between the Employees table and the Orders table could be restricted by requiring that rows from the Employees table have "NW" in the Region field and that rows from the Orders table have an amount in the Order Amount field greater than "$1,000.00". In this way, the result table is restricted to orders greater than "$1,000.00" taken by employees in the "NW" region.
FIG. 4 is a table diagram showing the contents of the result table formed by joining the Employees table with the Orders table with restrictions on the Region field and the Order Amount field. The result table 400 has the same columns as the result table 300 shown in FIG. 3. The result table 400 has fewer rows than the result table 300, since every row of the result table 300 in which the Region field does not contain "NW" or the Order Amount is not greater than "$1,000.00" has been excluded by the restriction.
A second joining technique involves maintaining indices on the columns of each table that may at some point be used as join columns. An index on a column is a mapping from the values appearing in the column to the bookmarks for the rows in which that value appears in the column. In the example discussed above, this technique would maintain indices on the Employee ID column of both the Employees table and the Orders Table. The second joining technique joins two tables by merging their indices. That is, each row in the result table corresponds to one possible combination of the bookmarks in rows of the index on the first table having a particular Employee ID field value with bookmarks in rows of the index on the second table having the same Employee ID value. This is accomplished by reading each row of the index on the first table, and seeking into the rows of the index on the second table having the same Employee ID value. A new row is formed in the result table for each combination of the bookmark in the current row of the first index and the bookmarks in the sought-into rows of the second index. Data may then be retrieved from the result table by dereferencing each of the bookmarks stored for that row of the result table.
Indices on the Employee ID column of both the Employees table and the Orders Table are shown in FIGS. 5 and 6 respectively. FIG. 5 is an index diagram of an index on the Employee ID column of the Employees table. The Employees index 500 maps from each unique value of the Employee ID column of the Employees table (shown as "Employees:EmployeeID") to the bookmark for each row of the Employees table that contains that value in its Employee ID field (shown as "Employees:Bookmark"). For example, the appearance of the bookmark "1" in the same row of the Employees index as the Employee ID "503" indicates that row 1 of the Employees table has a "503" in the Employee ID field.
FIG. 6 is an index diagram of an index on the Employee ID column of the Orders table. The Orders index 600 likewise maps from each unique value of the Employee ID column (shown as "Orders:EmployeeID") to the bookmark for each row of the Orders table that contains that value in its Employee ID field (shown as "Orders:Bookmark"). For example, the absence of a row of the Orders index 600 having a "503" in the Employee ID field indicates that no rows of the Orders table has a "503" in the Employee ID field. The appearance of the bookmarks "3", "5", "9", "14", and "17" in rows of the Orders index having a "525" in the Employee ID fields indicates that rows 3, 5, 9, 14, and 17 of the Orders table have a "525" in the Employee ID field. The second joining technique merges the Employees index with the Orders index by looping through the rows of the Employees index, and for each row, locating the row of the Orders Index that has the same Employee ID value as the current row of the Employees Index. The technique then forms a row of the result table for each combination of a bookmark listed in the current row of the Employees index and a bookmark listed in the current row of the Orders index.
FIG. 7 is a table diagram showing the result table from the above-described join operation. Each row of the result table 700 corresponds to a row of each of the source tables that has been joined. Each row therefore contains the bookmark for a row of the Orders table (shown as "Orders:Bookmark") and the bookmark for the row of the Employees table to which that row of the Orders table has been joined (shown as "Employees:Bookmark). For example, the appearance of the Orders:Bookmark "1" and the Employees:Bookmark "6" in the same row of the result table indicates that the join operation joined row 1 of the Orders table with row 6 of the Employees table. The bookmarks contained in any row of the "bookmark-only" result table shown can be used to retrieve the contents of the rows of the source tables referenced by the bookmarks.
Because the second joining technique reads only indices, which are much shorter than data tables, and because indices are optimized for searching, the second joining technique is much faster than the first joining technique, which requires reading and searching data tables that are much longer and that are not search-optimized. However, because the second joining technique does not read data tables, it is unable to apply restrictions on the contents of any column besides the join column. This failure to support restrictions constitutes a serious disadvantage.
A combination of the first joining technique and the second joining technique has been used to join multiple tables. At example of joining multiple tables is joining the Employees table to the Orders table using the Employee ID column of both tables, then the Orders table to an Order Details table using the Order ID column of both tables, then the Order De ails table to a Products table using a Product ID column of both tables. FIG. 8 is a join diagram demonstrating the combination joining technique or multiple tables. It shows the Employees table 810 having an Employee ID column, the Orders table 820 having an Employee ID column and an Order ID column, the Order Details table 830 having an Order ID column and an Product ID column, and the Products table 840 having a Product ID column. The diagram further shows three indices: an index 850 on the Employee ID column of the Employees table, an index 860 on the Order ID column of the Order Details table, and an index 870 on the Product ID column of the Products table. Applying the combined technique to complete this multiple join involves selecting to begin the join with the Orders table. According to the technique, the rows of the Orders table are traversed in order. For each row of the Orders table, indices on the join rows of the immediately adjacent tables, in this case the Employee ID column of the Employees table and the Order ID column of the Order Details table, are used to join the current row of the Orders table 821 to matching rows of the adjacent tables. First, the row of the index on the Employees table that contains the same Employee ID as the current row of the Orders table is located. This row of the index, 851, contains a bookmark for a row of the Employees table that has the same Employee ID as the current row of the Orders table, and to which the current row of the Orders table should be joined. Similarly, the row of the index on the Order Details table that contains the same Order ID as the current row of the Orders table is located. This row of the index, 861, contains a bookmark for a row of the Order Details table that has the same Order ID as the current row of the Orders table, and to which the current row of the Orders table should be joined.
At this point, the current row of the Orders table has been joined to appropriate rows of each of two of the other source tables (the Employees table and the Order Details table), but to a row of the fourth source table, i.e., the Products table. In order to join the Products table, each bookmark in row 861 of the Order Details index must be used to retrieve the corresponding row of the Order Details table. This requires the database system to read rows of the Order Details table in an arbitrary order, a very expensive operation when performed on the order of once for each row of the longest source table. The Product ID field is then extracted from every retrieved row of the Order Details table, e.g., row 831 of the Order Details table, and the row of the index on the Products table that contains the same Product ID as that row of the Order Details table is located. This row of the index, 871, contains a bookmark for a row of the Products table that has the same Product ID as the current row of the Order Details table, and to which the current row of the Order Details table should be joined. In the above-described example, the technique completes this iteration of the join by joining the current row of the Ordered Details table to the row of the Products table referenced by the bookmark contained by this row of the index, and then traversing to the next row of the Orders table to identify another set of rows from the source tables to be joined.
Because this technique requires the expensive operation of reading the data table in arbitrary order for any source tables after the first three that it joins (e.g., the Order Details table), it is inefficient for joining more than three source tables. It, like the second technique for joining two tables discussed above, is further hampered by the fact that, where the technique reads an index on a source table instead of the table itself, it is unable to apply any restrictions to the rows of the table joined.