1. Field
Embodiments of the invention relate to dynamic semi-join processing with runtime optimization.
2. Description of the Related Art
Relational DataBase Management System (RDBMS) software uses relational techniques for storing and retrieving data in a relational database. Relational databases are computerized information storage and retrieval systems. Relational databases are organized into tables that consist of rows and columns of data. The rows may be called tuples or records or rows. A database typically has many tables, and each table typically has multiple records and multiple columns. A RDBMS may use a Structured Query Language (SQL) interface.
A table in a database may be accessed using an index. An index is an ordered set of references (e.g., pointers) to the records in the table. The index is used to access each record in the table using a key (i.e., one of the fields or attributes of the record, which corresponds to a column). The term “key” may also be referred to as “index key”. Without an index, finding a record requires a scan (e.g., linearly) of an entire table. Indexes provide an alternate technique to accessing data in a table. Users may create indexes on a table after the table is built. An index is based on one or more columns of the table.
A query may be described as a request for information from a database based on specific conditions. A query typically includes one or more predicates. A predicate may be described as an element of a search condition that expresses or implies a comparison operation (e.g., A=3).
When a query is received by the RDBMS, the query specifies the data that the user wants, but not how to get to it. When the query is received, during a prepare phase, the RDBMS converts the query into an executable form. During a bind phase, the RDBMS determines access paths to the data to be retrieved by a query that describe how the data should be retrieved. Then, the RDBMS executes the query.
ANDing refers to an AND operation that combines record identifiers (RIDS). An AND operation between two operands may be described as a Boolean operation whose result has the Boolean value 1 if and only if each operand has the Boolean value 1. Index ANDing refers to a technique of processing a query by breaking the retrieval of a result set of record identifiers (RIDS) stored in an index into several operations, which may be referred to as semi-joins, and combining the result sets with AND operations. Similarly, ORing refers to an OR operation that combines RIDS. An OR operation between two operands may be described as a Boolean operation whose result has the Boolean value 1 if either operand has the Boolean value 1. Index ORing refers to processing a query by breaking the retrieval of a result set of record identifiers (RIDS) stored in an index into several operations, which may be referred to as semi-joins, and combining the result sets with OR operations. A join may be described as an SQL relational operation that allows retrieval of data from two or more tables based on matching column values. A semi-join may be described as a partial join in which access occurs from an outer table or index to an index of the inner table. Access to the inner table itself is deferred. A star join is a type of join used to process tables which form a star schema data model. The star schema data model contains a central fact table surrounded by multiple dimension tables, each joined to the central table representing a star-like pattern. In one example, the outer table of a semi-join is a dimension table and the inner table is the fact table. In this example, there may be multiple dimensions, each semi-joined to separate indexes of the inner fact table. Each semi-join accesses an index to retrieve RIDs that match one or more join predicates from the dimensions. The result of a semi-join is a list of all fact table RIDs that match the one or more join predicates. The resulting RIDs from all semi-joins are then ANDed (i.e., intersected) to produce a final RID list. The final RID list is then used to retrieve rows from the fact table.
With conventional index ANDing implementations, the order of ANDing the semi-join results is determined during the prepare or bind phases and stays fixed at execution time. It is possible that the order is not optimal or that one or more semi-joins provides poor selectivity.
For example, the following query (1) is a star join query with three dimension tables D1, D2, D3 and a fact table F. Query (1) selects a value from column C1 from a row in the fact table F when the value of column C1 in dimension table D1 equals the value of column C1 in the fact table, the value of column C2 in dimension table D2 equals the value of column C2 in the fact table, and the value of column C3 in dimension table D3 equals the value of column C3 in the fact table. Thus, for query (1), each dimension table is joined with the fact table as specified in the WHERE clause.SELECT F.C1 FROM D1, D2, D3, F WHERE D1.C1=F.C1 AND D2.C2=F.C2 AND D3.C3=F.C3;  Query (1)
A query optimizer of the RDBMS may determine the following semi-join ANDing operation order during the prepare or bind phase, and this operation order is executed sequentially at execution time. FIG. 1 illustrates structures 100 showing the semi-join ANDing order (1) of this example.
Order (1)
    1) R1=D1->F (result R1 is the result of semi-joining dimension table D1 with fact table F)    2) R2=D2->F (result R2 is the result of semi-joining dimension table D2 with fact table F)    3) R3=R1 AND R2 (result R3 is the result of ANDing result R1 and result R2)    4) R4=D3->F (result R4 is the result of semi-joining dimension table D3 with fact table F)    5) R5=R3 AND R4 (result R5 is the result of ANDing result R3 and result R4)
It is possible that at the execution time, result R1 yields a very large result set due to its poor selectivity. So processing the semi-join of dimension table D3 with the fact table F first and processing the semi-join of dimension table D2 with the fact table F second would have provided better performance (since ANDing result R2 and result R4 results in a smaller intermediate RID list than the RID list of ANDing result R1 and result R2 in this example). Moreover, it is possible that the computation of result R1 may run out memory for storing the RIDs before computation of result R2 and result R4 get a chance to complete.
Thus, there is a need in the art for improved processing of queries with semi-joins.