This invention relates to data processing, and in particular, to obtaining a plan for executing a query in a relational database.
Databases are well known systems for storing, searching, and retrieving information stored in a computer. The most prevalent type of database used today is the relational database, which stores data using a set of tables that may be reorganized and accessed in a number of different ways. Users access information in relational databases using a relational database management system (RDBMS).
In a typical RDBMS, data is externally structured into tables. Each table has a set of one or more columns and rows. The rows, or entries, in the tables generally correspond to data records, and the columns generally correspond to the fields in each data record. Each column typically specifies a name and a data type (e.g., integer, float, string, etc.) and may be used to store a common element of data.
Current relational databases require that queries be composed in query languages. Today, the most widely used query language is Structured Query Language (SQL). However, other query languages are also used.
An SQL interface allows users to formulate relational operations on the tables. Operators are provided in SQL, which allow the user to manipulate the data. Each operator operates on one or more tables and produces a new table as a result. SQL enables information from multiple tables to be linked together to perform complex sets of procedures with a single statement, such as a SELECT statement.
In modern RDBMS, the overhead associated with processing client requests can be troublesome. Cache and buffer overflow, I/O bottlenecks, wasted CPU cycle time, shared memory latch contention, network throughput, and other performance side effects often result from poor planning and untested design.
To avoid these and other by-products of a poorly designed system, a client/server DBMS architecture could benefit greatly from a streamlined database statement processing system. In a typical DBMS architecture, a client issues a database query (hereinafter illustratively referred to as a “SQL query”) to a process running on the database server. The server expends a great deal of its run-time resources in parsing the request, creating an execution tree, semantically analyzing the statement, and determining an optimal execution plan. These steps are needed to store and create a “cursor” in cache memory before the server can effectively carry out a client request or return a result set. A cursor is a handle to a query execution area, e.g., an area in memory in which a parsed SQL statement and other information for processing the SQL statement is stored.
Database systems frequently invoke an optimizer to handle the task of creating an execution plan that is optimized to a particular SQL statement. A query optimizer is a component of a database management system that attempts to determine the most efficient way to execute a query. Such query optimizers often operate by selecting from among multiple “plans”, or possible implementations of a query, so as to execute the query with the greatest efficiency. The output of an optimizer is typically referred to as a query plan or access plan, which is a form of executable code that can be processed by a database engine to execute the query. Many optimizers operate by selecting or generating multiple potential query plans for a given query, and selecting from the potential query plans an optimal query plan.
Optimization consists of multiple steps including generating query plans, collecting statistics on the data related to the query plans, using the statistics to estimate resource costs of the plan, and selecting a plan with the optimum resource costs. One of the resource intensive tasks of the optimization is collecting the statistics. Statistical information for the data in the underlying database may relate to tables and their indices, objects in a schema, objects in a database, column statistics, etc. Some typical column statistics include the column cardinality (i.e., the number of distinct values in the column), and the Frequent Value List (FVL), which identifies the most populous values in the column.
To perform a search of a table to locate records that match a particular criterion, a table can often be analyzed using either table scans or index probes. A table scan operates more or less by sequentially stepping through each record in a table to find matching records, while an index probe is keyed off of an index that is generated for the table. A table scan is typically more efficient when a large number of records match the criterion, while an index probe (which has additional overhead associated with generating the index) is typically more efficient when only a small number of records match the criterion.
Efficient SQL query plan caching is a problem. Generally, any schemas (with a set of tables and indexes) are generated with the same layout (templatized), in which only the underlying data is different but often has the same statistical properties. As an example, a car manufacturer my have relational databases for each dealer, while a title company may create one schema per order using different templates (pre-defined layouts) to process different types of orders. While this templatized solution works somewhat well for these businesses, a plan cache problem surfaces due to the fact that the same plan may get generated n times for the n different libraries and tables in those libraries. A typical reported problem is that the plan cache is too small because of the number of libraries and tables, and thus it needs to be set very large (e.g., 20 GB). Though the plan cache may be set to be very large, most plans are not reused but are swapped out instead. This is because no match is found with the current way plans are stored and matched in the plan cache.
As a typical example, there may be 1000 schemas with the same layout: tables, customers, orders, suppliers and same set of indexes. In this example, only the names of the schemas are different, e.g., S1, S2, . . . , etc. In this typical scenario, a sample query would look like this:
Select*From S1.customers, S1.orders
where customers.customerid=orders.customerid
where orders.timestamp between ‘12-106’ and ‘12-3106’
In this example, the query generates a plan instance to be cached or matched, and any other plans referencing data spaces other than S1.customer, S1.orders result in a caching mismatch. Thus, for the 1000 query variation with just a schema name difference, 1000 plans are to be generated or matched when looking for existing plans in order to possibly reuse them.
Generally, caching plans in the plan cache provides benefits by reducing full optimization and reusing matching plans for future database queries. Currently, plans are associated with the actual data spaces. So, in cases in which a system contains a large amount of schemas and query requests that are widely spread among those schemas, the benefits of caching plans are greatly reduced. As there is always a reasonable memory limit for the plan cache size, only certain number of plans can be stored. The new query requests keep hitting mismatches in the cache causing long searching overhead for matching plans, a great chance for current query request to do full-optimization, a higher memory footprint, and higher general overhead, for example for purging old(er) plans out of the plan cache to make room for new ones, when the plan cache limits are reached.
Another problem for relational databases it that is not always obvious to detect that two tables have the same general definition, i.e., the same “signature”, e.g., the same columns and the same set of indexes, and just a different name, or are located in just a different schema. In many cases, the underlying table data is the only difference. Currently, the table definition is a direct a part of the table. In businesses, such as a car manufacturer and a tile company, where many schemas are generated following the same templates, each schema contains the same tables, indexes, etc. When executing query requests, the query engine is not be able to see the structural similarity among those tables without a large amount of work. But, the structural similarity is very clear during the generation time.
Currently, the Query engine doesn't consider a plan to be reusable even when it is okay to be applied to similar tables (having the same “signature”). Full optimizations that are kicked off in this case could be avoided.