1. Field of the Invention
The present invention relates generally to computer database systems, and more particularly to systems and methods for executing multi-table joins in response to a query to a computer database for information.
2. Description of the Related Art
When a user requests data from a database by submitting a query to the database, it is often necessary that the database management system (dbms) that is associated with the database combine, i.e., join, data from two or mere tables that contain data. For example, it might be desired to know the names and salaries of employees in a company, by department. Using the database query language known as "SQL", two tables could be created as follows:
"create table DEPT (DNO, NAME, BUDGET)", where DNO is an integer representing the department number, NAME is a variable character alpha-numeric name of the department, and BUDGET is a floating point variable representing the department's budget; and PA1 "create table EMP (DNO integer, NAME varchar, SALARY float)", where DNO is an individual employee's department number, NAME is the employee's name, and SALARY is the employee's salary. PA1 select DEPT.NAME//department name PA1 from DEPT, EMP//the two tables discussed above
Thus, in the above example, the relevant data is arranged in a "department" table and in an "employee" table. In the "department" table, each row has the above-listed three department entries, and in the "employee" table, each row has entries corresponding to a particular employee's department number, name, and salary. To return the employee names and individual salaries by department number requires joining at least one column from the "department" table with at least one column from the "employee" table.
In the particular example discussed above, the department number entry from the "department" table is commonly referred to as a key (it defines a department), whereas the individual employees' department numbers are referred to as foreign keys (they define the respective departments in which the employees work). The columns DEPT.DNO and EMP.DNO establish a join key, and a row in the "department" table is joined to a row in the "employee" table only if the rows match on the join key, that is, only if DEPT.DNO=EMP.DNO for that row. The above-described join can be undertaken to associate employees by department using the following join query syntax:
EMP.NAME//employee name PA2 EMP.SALARY//employee salary
where DEPT.DNO=EMP.DNO//such that the employee department number equals a department number in the "department" table
The result of the above join is a third table in which employee names and salaries have been associated with department names by joining a row in the department table with a corresponding row in the employee table.
A method known to those skilled in the art for effecting the above-described join is referred to as "hybrid hash". Using a hybrid hash scheme, the smaller of two tables to be joined is designated a build table, and the other table is designated a probe table. Either table can be a base table, i.e., a table stored in the database, or a derived table, i.e., a table that is the result of a previous, intermediate join, or other query operations. In any case, if the entire build table can be stored in the main memory of the dbms computer, it is written into main memory, and the probe table is then checked against it for matches on the join key using a hash table on the join key.
Often, however, the entire build table is too large to be written into the main memory of the dbms computer. Under these circumstances, only a single partition of the build table is written into memory. To effect this, the build table is scanned and each row is hashed on its join key to establish one large partition that is written into memory and several smaller partitions that are written to disk. Stated differently, a join key mathematically operates on each of the rows of the build table in accordance with a hash function to yield respective hash values, with rows having a common range of hash values with each other being grouped together in a "partition".
Conventionally, a non-uniform hash function is used to generate "N" partitions, denoted B.sub.1, B.sub.2, . . . ,B.sub.N. The reason the conventional hash function is non-uniform is that the first partition B.sub.1, is special, in that it is sized to be just less than the size of the memory of the dbms computer for reasons that will be explained shortly. The remaining partitions are equally sized and are stored in disk, since the first partition B. essentially consumes all non-disk memory (except for a small portion that is reserved for subsequent staging to disk of the partitions B.sub.2, . . . B.sub.N).
After the build phase of a table join, the probe phase is undertaken. During the probe phase, the same hash function that was used during the build phase is used to divide the probe table into partitions P.sub.1, P.sub.2, . . . ,P.sub.N, and then rows of each probe partition P.sub.i are joined to rows of the corresponding build partition B.sub.i that match on the join key. More particularly, when a probe table row is hashed on its join key and the resulting value maps to the first probe table partition P.sub.1, the main-memory hash table partition B.sub.1 is searched to determine whether a match exists, and if so, the probe and build rows are joined and output. On the other hand, if the hash value of a probe table row maps to any probe partition P.sub.i other than the first partition P.sub.1, the probe table row is spooled to a small staging area in main memory before being written to disk. After all rows in the first probe partition P.sub.1 have been checked against the first build partition B.sub.1, the next build partition B.sub.i is written into memory from disk, a main-memory hash table is constructed based on it, and then the corresponding probe partition P.sub.i is scanned to check its rows for matches with the rows of the build table B.sub.i. The process is repeated until all build partitions have been serially written into main memory and checked.
As recognized by the present invention, the performance of the above-described conventional table join depends on how well the size of the first build partition B.sub.1 matches the size of the main memory, because data that hashes to the remaining partitions must be written to and read from disk, and it is desirable for efficiency considerations to minimize such input/output (I/O) operations. Thus, to minimize such I/O operations, the size of the first build partition ideally matches the size of the main memory (less space for overhead and staging). As recognized herein, however, a "good" match is not always effected as well as might be hoped, because the values used in the hash function are based on database statistics that are only approximate, and consequently the actual size of the first build partition can be greater or smaller than expected. When the size of the first partition exceeds the size of the main memory, a memory overflow results; when it is smaller than memory, I/O for probe tuples is not optimized. Either way, performance is degraded.
In a paper entitled "The Effect of Bucket Size Tuning in the Dynamic Hybrid of GRACE Hash Join Method" by Kitsuregawa et al., published in the Proceedings of the Fifteenth Int'l Conf. on Very Large Data Bases (1989), to address the first of the above-mentioned problems, namely, memory overflow, it is suggested to overpartition the build table. Essentially, Kitsuregawa et al. propose dividing the build table into many small partitions and, thus, avoiding any one partition overflowing memory. Unfortunately, Kitsuregawa et al. do not consider the second of the above-noted problems, namely, efficiently packing memory so that I/O is minimized during processing of the probe table. Fortunately, the present invention recognizes that the above-noted problems can be addressed.