Structured Query Language (SQL) is a standardized language for accessing and updating relational databases. The American National Standards Institute (ANSI) and the International Standards Organization (ISO) published the first standard in 1986. The current standard is SQL-99.
SQL provides a mechanism for creating tables, each of which contains rows and columns of information that are collectively assembled into a database. Ideally, the tables are xe2x80x9cnormalizedxe2x80x9d in that the structure of the tables avoids data redundancy and allows the resulting data model to be mapped to many different physical database designs. In order to avoid redundancy, yet still be able to display data from multiple tables, SQL provides a mechanism called a join. There are many types of joins that provide various results. Some types of joining, however, have some important performance and cost implications.
Once multiple tables are involved, performance can be affected significantly by the execution plan (i.e., in the specific way the tables are joined). Join performance is much more acute when the tables are large, such as in data warehouse applications. The quest for performance has encouraged optimization of joining techniques and the development of the sort merge join, the nested loops join, and the hash join.
Hashing translates an index, or a join-column value, to an offset and then to a database address. If a row needs to be selected on the basis of a supplied index value, it can be done by converting the index through a hash algorithm, to an offset that can then be added to the xe2x80x9crow identifierxe2x80x9d of the first row to provide the address of the block where the information is stored. Thus, a row can be identified through the key value without applying an index and without having to perform a full table scan. This mechanism can be used in a hash cluster, which contains rows with the same hash value. In certain circumstances, hash clusters can provide considerable performance advantages over indexing.
In situations where the SQL query contains an inclusion or exclusion condition (i.e., an xe2x80x9cINxe2x80x9d or xe2x80x9cNOT INxe2x80x9d condition, respectively), a merge-join is normally performed. Typically, the merge-join requires two full table scans and a sort operation. Such sort operations typically require significant resources and add significantly to the cost of the operation. Employing a hash join would reduce the resources necessary to perform such an operation. There is, therefor, a need in the art for a hash join method that allows for inclusion/exclusion conditions in the SQL statement.
The invention overcomes the above-identified problems as well as other shortcomings and deficiencies of existing technologies by providing a method of including inclusion and exclusion conditionals in a hash join that consumes fewer resources than traditional merge-joins.
Accordingly, an exemplary embodiment of the invention is directed to a method for performing inclusion and exclusion hash joins. The method enables the joining of an inner table and an outer table in a database in response to a SQL statement having an inclusion or exclusion operator.
In general, the method of the present invention relates to joining an inner table and an outer table in a database in response to a query statement having an inclusion operator. The method first transposes (swaps) the inner and the outer tables to form left and right tables, respectively. The swapping of tables enables the utilization of existing (optimized) join modules. Thereafter, a hash table is created from the left table. Then, the rows of the right table are scanned and compared (probed) to the values in the hash table. If a right table hash value matches a value in the hash table, then the actual values of the respective rows are evaluated for the join condition and, if satisfied, that outer table row is included in the result. Because the hash match is a necessaryxe2x80x94but not sufficientxe2x80x94condition for the outer row to be qualified, the additional step of evaluating the join, namely a check to determine if the join column value from the outer and inner rows also match, must be performed before the row is included in the result. Typically, the comparison process proceeds one row at a time. However, the method of the present invention is amenable to parallelization, with processing occurring one row at a time for each of the parallel processors.
Yet another alternate embodiment of the method of the present invention relates to joining an inner table and an outer table in a database in response to a query statement having an exclusion operator. As before, the method of the present invention first transposes (swaps) the inner and the outer tables to form left and right tables, respectively. A hash table is created from the left table. Then, the rows of the right table are scanned and compared probed) to the values in the hash table. In the case of the exclusion hash join of the present invention, if the hash value from the outer table is found not to be equivalent to any value in the hash table, then the row from the outer table (which is already on hand because it was used to calculate the outer hash table value) can be included immediately in the join result and further processing for that right table row can stop, again conserving system resources. Quick termination as provided by the present invention results in considerable savings in both time, memory requirements, and computing capacity. If one or more values in the hash table match the hash value from the right table, then the corresponding rows of the outer table must be evaluated for the join condition. Only if the join conditions are satisfied is the outer row excluded from the result. Otherwise, processing continues for the other rows of the right table. Thus, for the exclusion join situation, the outer row will have to be paired (compared) to all of the inner rows with the same hash and, if no match is found, only then would that outer row be included in the result. Typically the comparison process proceeds one row at a time although this process is amenable to parallel processing, with each processor proceeding one row at a time.
The method of the present invention can be implemented on a database system having a database with, typically, two tables for responding to SQL statements that designate an outer table and an inner table within the database. Alternatively, one physical table can be used and referenced twice, first as the outer table, and then again as the inner table. The database system itself consists of system memory that is capable of storing a hash table and perhaps other database-related parameters. In addition, the database system has at least one processor that is operative with the system memory. The processor is used to generate a hash table from the one of the tables. The processor is also used to scan the other table (which may be empty, e.g., have zero records) in order to generate an hash value that can then be compared the hash values in the hash table to determine whether or not an inclusion or an exclusion condition is satisfied. Those outer table rows satisfying the inclusion/exclusion condition are appended to a result.
Features and advantages of the invention will be apparent from the following description of the embodiments, given for the purpose of disclosure and taken in conjunction with the accompanying drawings.