This invention relates generally to digital computers, and more particularly to systems and methods for optimizing queries used with databases systems.
A database system is a computerized record keeping system which allows a user of a database to manipulate information. One type of database which has gained popularity because of its ease-of-use and simple structure is a relational database. Consider the ubiquitous spread-sheet.
In a relational database, the relationships of the data are usually maintained in a table having rows and columns. The rows organize instances of related data, and the columns group data having, common characteristics or attributes.
The operation of locating data in the database is called a query. Typically, queries are expressed in a query language, for example a standard query language (SQL). The statements of the query language set out the search criteria in logical or xe2x80x9cBooleanxe2x80x9d expressions. A simple query for a census database could be paraphrased as: locate all persons having an income in the range of $20,000 to $30,000. In this statement, the attribute of the subject persons to be examined is xe2x80x9cincome,xe2x80x9d the logical condition to be satisfied is xe2x80x9cbetweenxe2x80x9d, and the values which must satisfy the logical condition are in the range of 20000 to 30000.
In a query, the expressions which determine the search criteria are called predicates. Most predicates in queries are phrased in terms of a range of values, for example greater than, less than, between, etc. These type of predicate relationships are called RANGE predicates. Other type of query specifications, for example mathematical relationships, are called non-RANGE predicates.
In a query, range, and non-RANGE predicates can be related by logical operators such as AND, OR, and NOT. During processing of the query, if a particular value falls within the specified range, a logical true condition is returned for the predicate, otherwise a false condition is returned. The true and false conditions of all query predicates can be combined to determine which data are selected from the database.
Processing queries is a frequent, and often time consuming task since the database system can include a large amount of data, and a given query may need to retrieve data from different portions of the database. A recurring problem in the design of database system is to make the processing of queries more efficient.
One way to locate data is to sequentially scan the rows of the table, e.g., the records of the database, for the attribute values which satisfy the search criteria. This is known as a table scan, because the relationship table is scanned from beginning to end. Although the table may imply that the rows storing related data are in sequential order, in fact, the data may be spread-out randomly over multiple storage devices. Therefore, table scans are generally inefficient and time consuming.
In most database systems, a separately maintained index structure is used to facilitate the locating of records of data. The index includes keys having values and addresses. The keys, corresponding to records, are usually stored in a selected order of their values, and the addresses correspond to physical locations of the disk blocks and memory pages where the data associated with the key are stored. If the keys of the index are constructed from several database attributes or columns, the key is called a compound key, and the individual portions of the key corresponding to the several attributes are called key xe2x80x9csegments.xe2x80x9d
One method of organizing an index is a B-tree with a xe2x80x9croot,xe2x80x9d intermediate xe2x80x9cbranchesxe2x80x9d and interior nodes depending from the root, and leaves depending from the end of branches. The term B-tree is short for xe2x80x9cbalanced tree,xe2x80x9d and refers to the roughly equal number of branches which depend from the root and intermediate branches, and the uniform or balanced distribution of the leaves among the branches. The leaves correspond to the actual index entries storing key and address data.
A particular index entry can be found by following the branches from the root to the desired leaf. For example, to scan the rows of a relational table in a selected index order, the leaves are scanned sequentially, and the index entries at the leaves are used to locate the rows of the database.
When an index scan is used to find keys, the search for data which satisfy the query conditions can be performed more efficiently than for a table scan. For example, prior to scanning the B-tree, it could be determined that certain branches do not include any leaves which satisfy the search criteria. The searching of the leaves of such branches can be skipped in their entirety to speed up the querying process.
Most prior art database systems have included a component called a query optimizer. The query optimizers transforms a given query into an equivalent optimized query which may be processed more efficiently than the given query. The design of a query optimizer is complex, and is made more so by the fact that certain factors which influence the optimizing transformation are not known until the query is actually processed.
Range optimizers extract true and false ranges of key values from the query expression. Only those branches which include keys within the true range are scanned. Branches with key values in the false ranges are skipped as if they were xe2x80x9cprunedxe2x80x9d from the B-tree. However, if the query includes multiple conditions, the true and false ranges may overlap so that the scan of the index is less than optimal. Also, the values of the predicates of the queries are frequently expressed in terms of variables which are only resolvable when the query is processed.
In another type of query optimizer, the query is rearranged according to the frequency of true evaluations of the underlying query predicates in order to decrease the average length of time required for Boolean evaluation and consequently for scanning the index. However, range optimization is usually substantially more efficient than scanning according to frequencies because it reduces the number of physical accesses to the data, and because most queries are phrased in terms of range conditions.
Understandably, there is a need for a query optimizer which can process a query having multiple ranges more efficiently. Furthermore, the optimizer should be able to optimize queries which include values which are expressed as variables. Furthermore, the optimizer should further reduce processing time by eliminating unnecessary physical accesses of the data while executing of the query. The invention satisfies these objectives.
This invention provides a method for scanning and evaluating records of a database according to a query composed as an arbitrary Boolean restriction or selection criteria. The invention determines the near-largest interval for which the selection criteria is always false and avoids scanning the corresponding portion of the database. Also, within any interval of values for which the selection criteria is always true, evaluation of the records is not necessary, since the records satisfy the selection criteria.
The technique uses a computerized query optimizer. The database is organized as an ordered set of records. The optimizer employs a scanner and an evaluator. A query is composed as ranges of record values related by logical operators. Prior to scanning and evaluating the records of the database, the query is converted to a Boolean tree in canonical form. The tree is optimized to express the ranges of values as a set of disjoint semi-open ranges. Overlapping ranges related by the logical AND operator are collapsed into a single intersecting range. Similarly, ranges related by the OR operator can be expressed as a union of ranges.
The scanner reads a next record from the database. The evaluator, using the selection criteria, delivers a logical true or false condition for the record. In addition, the evaluator also delivers an interval of values having the same logical condition as the logical condition of the record. The interval includes the value of the record, and a high limit value of the interval. If the logical condition of the record is false, the scanner skips over records having values of the interval. That is, records having values in this interval do not need to be read or evaluated. Otherwise, if the logical condition is true, records having values of the interval are selected without requiring individual evaluation of the record values with respect the selection criteria.
In an alternative embodiment, the invention is applied to records having values which are a concatenation of several record attribute values or segments. In this embodiment, a partial logical evaluation of the attribute values is performed in the same order that the segments or attributes are concatenated. As the partial logical conditions are delivered, the technique determines if the partial logical condition of the segment is definite for the entire record value. If this is true, the logical condition of the entire records value is immediately known and evaluation of the remaining segments is not required. Otherwise, additional attribute values are evaluated until a definite logical condition can be determined. For each segment, a near-largest interval is maintained to accelerate the selection of records.