Databases are used to store information for an innumerable number of applications, including various commercial, industrial, technical, scientific and educational applications. As the reliance on information increases, both the volume of information stored in most databases, as well as the number of users wishing to access that information, likewise increases. Moreover, as the volume of information in a database, and the number of users wishing to access the database, increases, the amount of computing resources required to manage such a database increases as well.
Database management systems (DBMS's), which are the computer programs that are used to access the information stored in databases, therefore often require tremendous resources to handle the heavy workloads placed on such systems. As such, significant resources have been devoted to increasing the performance of database management systems with respect to processing searches, or queries, to databases.
Improvements to both computer hardware and software have improved the capacities of conventional database management systems. For example, in the hardware realm, increases in microprocessor performance, coupled with improved memory management systems, have improved the number of queries that a particular microprocessor can perform in a given unit of time. Furthermore, the use of multiple microprocessors and/or multiple networked computers has further increased the capacities of many database management systems.
From a software standpoint, the use of relational databases, which organize information into formally-defined tables consisting of rows and columns, and which are typically accessed using a standardized language such as Structured Query Language (SQL), has substantially improved processing efficiency, as well as substantially simplified the creation, organization, and extension of information within a database. Furthermore, significant development efforts have been directed toward query “optimization,” whereby the execution of particular searches, or queries, is optimized in an automated manner to minimize the amount of resources required to execute each query.
Through the incorporation of various hardware and software improvements, many high performance database management systems are able to handle hundreds or even thousands of queries each second, even on databases containing millions or billions of records. However, further increases in information volume and workload are inevitable, so continued advancements in database management systems are still required.
One area that has been a fertile area for academic and corporate research is that of improving the designs of the “query optimizers” utilized in many conventional database management systems. The primary task of a query optimizer is to choose the most efficient way to execute each database query, or request, passed to the database management system by a user. The output of an optimization process is typically referred to as an “execution plan,” “access plan,” or just “plan.” Such a plan typically incorporates (often in a proprietary form unique to each optimizer/DBMS) low-level information telling the database engine that ultimately handles a query precisely what steps to take (and in what order) to execute the query. Also typically associated with each generated plan is an optimizer's estimate of how long it will take to run the query using that plan.
An optimizer's job is often necessary and difficult because of the enormous number (i.e., “countably infinite” number) of possible query forms that can be generated in a database management system, e.g., due to factors such as the use of SQL queries with any number of relational tables made up of countless data columns of various types, the theoretically infinite number of methods of accessing the actual data records from each table referenced (e.g., using an index, a hash table, etc.), the possible combinations of those methods of access among all the tables referenced, etc. An optimizer is often permitted to rewrite a query (or portion of it) into any equivalent form, and since for any given query there are typically many equivalent forms, an optimizer has a countably infinite universe of extremely diverse possible solutions (plans) to consider. On the other hand, an optimizer is often required to use minimal system resources given the desirability for high throughput. As such, an optimizer often has only a limited amount of time to pare the search space of possible execution plans down to an optimal plan for a particular query.
A typical optimization process often involves numerous mathematical calculations in order to estimate the amount of processing resources required to execute a query using a particular access plan, often referred to as the “cost” of the access plan. A cost is usually expressed in some unit of time, and an optimizer typically works by calculating the cost for each plan, then comparing each plan to the other available options and choosing the best (or most efficient) plan, a process often referred to as “costing.”
Nearly all computations involved in a costing process are based on statistical information about the values in particular columns of data in the files (or tables) referenced in a query, and as such a costing process is necessarily probabilistic in nature since the only way to know such information with 100% certainty is to run the query. Doing so would be foolish, since a) there are countless possible plans to try, and b) the whole purpose of the optimizer is to generate the best way to run the query a priori.
The statistical information about the values in columns of tables is referred to as metadata, which can be used to help an optimizer find answers to an innumerable number of questions, including but not limited to questions such as:                How many rows will be selected by a given WHERE clause (without regard to which access path or plan is chosen)?        How many random I/O and/or sequential I/Os will occur if a given access path is used to evaluate a given row selection?        How many groups will be produced for a given grouping expression?        How many distinct values will be found in the result set defined with a given select list expression, group by expression, and selection expression?        
Conventional optimizers typically contain all aspects of such metadata, including their source (statistics collection, index statistics, etc.), type of data collection (e.g., random sample-based histogram), and the complexity of manipulating this raw information into metadata that can be directly used by the optimizer in its costing work.
One problem associated with incorporating metadata handling functionality in an optimizer is due to the lack of flexibility in optimally processing metadata during costing operations. In particular, collecting and refining metadata can be performed in a number of different manners, and new algorithms for performing such operations are continuously being developed. However, typically an optimizer is hard coded with specific algorithms for collecting and refining metadata. The introduction of a new algorithm into an optimizer would typically require extensive changes to the optimizer, so it is unlikely that a particular optimizer design will use the most efficient and up-to-date algorithms to handle metadata.
As an example, some optimizers use a type of metadata referred to as a “filter factor” to do estimates of how many records part of query will return, a process that is often critical in deciding, for example, in what order to join multiple files. There are many ways for such filter factors to be determined based on raw statistics and other data; for the purposes of optimization, however, those methods should be irrelevant. The optimizer simply wants to know what to use for a filter factor; it should not have to be concerned with how to construct one. Nonetheless, conventional optimizer designs are often limited to one particular method of calculating a filter factor.
In addition, the optimization of a single query (referred to herein as an optimizer instance) may be a long-running process, during which the contents and other characteristics of the involved database files may change. To ensure consistent costing results, each optimizer instance desirably would use consistent values for its metadata, no matter how many times the same columns and combinations are used in cost estimation. To address this concern, some optimizers attempt to keep track of these values themselves, which adds unnecessary complexity to their logic. Other optimizers ignore the problem altogether, often resulting in inexplicable and unreproducible cost estimate differences.
Furthermore, conventional optimizer implementations typically retrieve raw statistical data and other information about the values in a table directly from the table itself. This means that for similar queries involving the same columns in different processes and/or in different but similar combinations (a very common occurrence in most environments), the same calculations and analysis must be repeated in each optimization process that takes place on the system in order to do cost estimation for columns from those tables. Often, this is a waste of time and system resources, since if the values in the file do not change for a given period of time, neither will any of the metadata or the values derived from them during that period.
Therefore, a significant need exists in the art for a manner of facilitating the management of metadata in association with generating cost estimates during optimization of database queries in a database management system.