Database systems store, retrieve, and process information. A user retrieves information from the database by providing a query that is written in a query language such as SQL. The query specifies the information to be retrieved and the manner in which it is to be manipulated or evaluated. To process the query, the database system typically converts the query into a relational expression that describes algebraically the result specified by the query. The relational expression is then used to produce an execution plan, which describes particular steps to be taken by a computer in order to produce the sought result.
Formerly, the data in a database was treated as being static with different queries being applied to retrieve portions of the data. More recently, a stream processing or continuous querying model treats the queries as being generally static with the data of the database being dynamically added or updated. The stream processing or continuous querying model accommodates complex queries that are typically formed from common query components or sub-expressions. The common sub-expressions of these complex queries may be used in many different queries.
In the stream processing or continuous querying model, queries are executed instantaneously as data in the database changes or is added. In this context, the importance of utilizing common sub-expressions for query evaluation is paramount, since the processing work done to evaluate a single sub-expression can be shared by multiple queries. Optimizing sets of queries that share common sub-expressions is referred to as multi-query optimization. Query optimization takes a parsed representation of a SQL query as input and is responsible for generating an efficient execution plan for the given SQL query.
A query in the Structured Query Language (SQL) format includes a selection of rows from one or more columns in a specified table according to a specified condition or predicate. For example, a database table “R” may include a column A, with alphabetic values x, y, x, n, etc. in successive rows, and a column B with numeric values 55, 30, 19, 60, etc., in successive rows, respectively, so that values x and 55 are in a common row, values y and 30 are in a common row, and so forth. Table R could include many other columns as well, which could be designated C, D, E, etc.
An SQL query (e.g., referred to as Q(alpha)) may specify, for example, a selection of rows from column A and column B, in Table R, according to the predicate that the numeric values in column B are greater than 50. Another SQL query (e.g., referred to as Q(beta)) may specify, for example, a selection of rows from column A, column B, and column C in Table R, according to the predicate that the numeric values in column B are greater than 75. Queries Q(alpha) and Q(beta) could be just two of many (e.g., hundreds or thousands) of query sub-expressions that are applicable to the database.
In conventional multi-query optimization, query sub-expressions with common predicates, such as queries Q(alpha) and Q(beta), are grouped together to form a group filter or predicate index. Queries Q(alpha) and Q(beta) have common predicates in that both relates to the same variable (i.e., values in column B) and both employ the same operand (i.e., the “greater than” comparison), but different constants (i.e., 50 and 75). The grouping of such queries forms a tree structure for the “B>” predicates in which the nodes in the tree are the different constants of the different predicates.
In this example, the constant value 50 of Q(alpha) forms a first left node of the tree structure, and the constant value 75 of Q(beta) forms a first right node of the tree structure. Successive queries with common predicates are added to the tree structure such that constants with a value greater than a node branch to the right from the node and constants with a value less than a node branch to the left from the node. Once the predicate index is formed, new data can be readily compared against all the common predicate queries according to where the new data fits in the tree structure.
A limitation of such conventional multi-query optimization; however, is that changed data must be compared against all the predicate indices for the table, even if the data changes affect only one column. As a consequence, data changes result in large amounts of query evaluation that require extensive computational resources even when some of the query evaluations are unrelated to the changed data.
Accordingly, the present invention provides multi-query optimization that remembers the outcomes of previous predicate query evaluations to efficiently handle the case where data is changed. In this way, only changes to affected data columns need to be reconciled when continuous query evaluation is performed. In one implementation, the invention includes a database query optimization method for database queries that include predicates, the queries being directed to at least one database table having data in rows and columns. The method includes associating a lineage bit set with each row of the database table, the lineage bit set providing a binary predicate state indication whether each predicate is met or are not met by data in the row. A dependency bit set is established to provide a binary indication of the predicates that are included in each query. Queries are applied against the database table by a logical operation between the lineage bit set and the dependency bit set.
Another aspect of the invention includes a database query optimization data structure, which includes a lineage bit set associated with each row in the database table. The lineage bit set provides a binary predicate state indication whether each predicate is met or are not met by data in the row. The data structure further includes a dependency bit set that provides a binary indication of the predicates included in each query.
A benefit of the lineage bit set is that it preserves the state of previous predicate evaluations. The history of what query predicates a data row currently satisfies is maintained by the bits of the lineage bit set. This is particularly advantageous for handling data updates because query evaluation can be limited to only a subset of involved predicates. When a data row changes, only those predicates associated with modified data columns need to be evaluated. The shared predicate registry can be used to map modified data columns to such involved predicates, so only the identified predicates need to be evaluated for the changed data row. The state of the other, non-involved predicates identified by a query's dependency bit set are preserved in the row's lineage bit set and can be efficiently combined with the involved predicates.
Additional objects and advantages of the present invention will be apparent from the detailed description of the preferred embodiment thereof, which proceeds with reference to the accompanying drawings.