Databases are used to store information for an innumerable number of applications, including various commercial, industrial, technical, scientific and educational applications. As the reliance on information increases, both the volume of information stored in most databases, as well as the number of users wishing to access that information, likewise increases. As the volume of information in a database, and the number of users wishing to access the database, increases, the amount of computing resources required to manage such a database increases as well.
Database management systems (DBMS's), which are the computer programs that are used to access the information stored in databases, therefore often require tremendous resources to handle the heavy workloads placed on such systems. As such, significant resources have been devoted to increasing the performance of database management systems with respect to processing searches, or queries, to databases.
Improvements to both computer hardware and software have improved the capacities of conventional database management systems. For example, in the hardware realm, increases in microprocessor performance, coupled with improved memory management systems, have improved the number of queries that a particular microprocessor can perform in a given unit of time. Furthermore, the use of multiple microprocessors and/or multiple networked computers has further increased the capacities of many database management systems.
From a software standpoint, the use of relational databases, which organize information into formally-defined tables, and which are typically accessed using a standardized language such as Structured Query Language (SQL), has substantially improved processing efficiency, as well as substantially simplified the creation, organization, and extension of information within a database. Furthermore, significant development efforts have been directed toward query “optimization”, whereby the execution of particular searches, or queries, is optimized in an automated manner to minimize the amount of resources required to execute each query.
Through the incorporation of various hardware and software improvements, many high performance database management systems are able to handle hundreds or even thousands of queries each second, even on databases containing millions or billions of records. However, further increases in information volume and workload are inevitable, so continued advancements in database management systems are still required.
Many conventional database management systems, for example, are inherently interpretive systems, where queries are written in an interpretive language such as SQL, and dynamically interpreted by a query engine during query execution. Runtime interpretation in a computer environment, however, almost always results in reduced performance as compared to direct execution of executable code. Other conventional database management systems have attempted to reduce the amount of interpretation required to execute a query, typically by generating queries that comprise assembled blocks of code, such that a query engine needs only to select and execute code blocks that correspond to particular query instructions.
In addition, many conventional database management systems incorporate query optimizers, which operate to optimize the performance of a query to be executed by a query engine. Such query optimizers often operate by selecting from among multiple “plans”, or possible implementations of a query, so as to execute the query with the greatest efficiency.
As an example, in a relational database, data is stored in tables, where the rows, or entries, in the tables generally correspond to data records, and the columns generally correspond to the fields in each data record. Thus, for example, in a table that stores information about a company's employees, the table may include columns, or fields, representing first name, last name, location, salary, etc., with each row representing each record in the table.
To perform a search of a table to locate records that match a particular criterion, a table can often be analyzed using either scan probes or index probes. A scan probe operates more or less by sequentially stepping through each record in a table to find matching records, while an index probe is keyed off of an index that is generated for the table. A scan probe is typically more efficient when a large number of records match the criterion, while an index probe (which has additional overhead associated with generating the index) is typically more efficient when only a small number of records match the criterion.
Thus, using the above example, assuming a query was directed to finding all employees that had a salary below $50,000, assuming that most employees had a salary below that range, a scan probe would typically be more efficient than an index probe. On the other hand, assuming a query was directed to finding all employees having a first name of “David”, an index probe would typically be more efficient, as the overhead associated with indexing the records based upon first name would be offset by the ability to directly lookup the relatively few number of records that matched the criterion.
Query optimizers typically rely on statistics, developed over time, to select among multiple plans so that the most efficient plan for a particular type of query would be selected. Therefore, a query optimizer in the aforementioned example might recognize that a query directed to the salary field typically generates a large number of matching records, and as such, a scan probe-based plan would be the most appropriate for queries directed to the salary field. Likewise, such a query optimizer might recognize that a query directed to the first name field typically generates a small number of matching records, and as such, an index probe-based plan would be the most appropriate for queries directed to the first name field.
In many circumstances, however, a query optimizer may not be able to determine, during optimization, which type of plan is most efficient for a particular situation. In particular, often the statistics required to make a reasoned decision are only available at runtime, and thus after query optimization has already occurred.
As an example, consider a given query that includes a selection based upon a pattern string that can change during query execution. Based upon what pattern string is used during query execution, precisely which plan is more efficient may change. For instance, in the above example, if a pattern string such as “Dav %” on a first name field is used, an index scan may be the most efficient. However, if a pattern string such as “% id”, where the string is based upon non-leading characters, is used on the first name field, an index scan would not be appropriate. Instead, a table scan would need to be used. Therefore, if the pattern string that is supplied to a query can change during runtime, the query optimizer would not be able to select one of the plans, and would need to default to one plan that is most likely suboptimal in some situations.
In addition, other factors, known only at runtime, may affect which plan is most efficient to implement a given query. For example, different plans may utilize computer resources to various degrees, e.g., one plan is processor intensive, while another is less processor intensive, but requires a comparatively greater amount of working storage. Given that the workload of a database management system may vary from time to time, different plans could work better than others based upon system workload. However, factors such as system workload will generally not be known until query runtime, so a query optimizer could not use performance data such as workload or free storage in selecting an optimal plan to implement a query.
So long as the types of statistics available for use by a query optimizer are limited to those capable of being derived at optimization time, a query optimizer is effectively hampered in its ability to select the most appropriate plan to implement a given query. Therefore, a significant need exists in the art for a manner of optimizing the execution of a query that is capable of making use of a wider variety of statistics in selecting optimal plans for implementing a query.