Databases are used to store information for an innumerable number of applications, including various commercial, industrial, technical, scientific and educational applications. Many databases are relational databases, which organize information into formally-defined tables consisting of rows (i.e., records) and columns (i.e., fields), and which are typically accessed using a standardized language such as Structured Query Language (SQL). Database management systems (DBMS's) are the computer programs that are used to access the information stored in the databases, and to process searches, or queries, against the databases.
In general, a database query is an expression that references at least one database table in a database and includes one or more predicates connected with logical operators. Predicates that are connected with a logical operator may be referred to as the operands of the logical operator. The predicates may include an expression that references a field of the database table, multiple key values in that field, and an operator (e.g., in, etc.). The query may also include a clause that indicates that a subset of an answer set (e.g., a result table) should be returned instead of the entire answer set. To execute the query, many DBMS's perform query optimization, in which multiple execution plans or access plans for satisfying the database query are examined by a query optimizer to determine the most efficient way to execute the query.
The query optimizer determines the best access plan for executing the query by rearranging and regrouping the predicates that constitute the query into several different logical expressions. Each logical expression is then evaluated until the most cost effective plan is obtained. Memory is required to store each logical expression and time is required to evaluate each logical expression. As the number of predicates increases, the logical expressions formed by rearranging and regrouping the predicates become more complicated. Thus, more storage is required to store the internal representation of each expression, and more processing time is required to evaluate each expression.
A data structure consisting of a truth table and a list of predicates may be used to represent a logical expression. The truth table may be generated from the list of predicates of the logical expression. The truth table typically contains N columns, one column for each predicate, and 2N rows, where N is the number of predicates in the logical expression. Each row of the truth table corresponds to a different combination of logical predicate values. N result values are associated with the truth table, each result value corresponding to the value of the logical expression corresponding to the combination of predicates in a row of the truth table. The truth table may be summarized as a list of 2N binary characters, which includes a binary character for each result value of the truth table.
As the number of predicates increases, the storage required to store the internal representation of each logical expression may grow exponentially because 2N binary characters may be required to summarize the truth table of each logical expression. For example, when the logical expression contains 20 predicates, 220 bits, or 128 KB, of memory may be required to represent the truth table of the logical expression. When the logical expression contains 40 predicates, 240 bits, or 128 GB may be required to represent the truth table of the logical expression. But, to represent a moderately complex expression of 1,000 predicates, 21000 bits, or 1.247×10291 GB, is required to represent the truth table of the logical expression. Thus, the truth table model may be inadequate for efficiently representing complex logical expressions containing many predicates.
The selectivity of an expression may be estimated to aid query optimization. Selectivity may be estimated by breaking an expression into intersections, constructing a subset expression from each intersection, calculating the selectivity for each subset expression, and using the calculated selectivity of each subset expression to estimate the selectivity of the entire expression. As the expression becomes more complex, the number of intersections for the expression may be too numerous to process within a reasonable amount of time. In such cases, a subset of intersections may be evaluated to estimate the selectivity of the entire expression, allowing optimization to remain fast without sacrificing the accuracy of the estimate. A truth table may also be used to represent the logical expression for selectivity estimation. But, because of the same scaling issues associated with storing the truth table in memory, estimating the selectivity of an expression based on a truth table representation of the expression may be inadequate for complex expressions.
Therefore, a need continues to exist in the art for an improved manner for representing a database query expression.