1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to an automated system for eliminating unnecessary statistics collections for query optimization.
2. Description of Related Art
Relational DataBase Management Systems (RDBMS) using a Structured Query Language (SQL) interface are well known in the art for use in data warehouses. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO).
In an RDBMS, all data is externally structured into tables. A table in a relational database is two dimensional, consisting of rows and columns. Each column has a name, typically describing the type of data held in that column. As new data is added, more rows are inserted into the table. A user query selects some rows of the table by specifying clauses that qualify the rows to be retrieved based on the values in one or more of the columns.
The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host languages such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator performs functions on one or more tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement.
One of the most common SQL queries executed by the RDBMS is the SELECT statement, which is used to perform a query. In the SQL standard, the SELECT statement generally comprises the format: “SELECT<clause>FROM<clause>WHERE<clause>GROUP BY<clause>HAVING<clause>ORDER BY<clause>.” The clauses generally must follow this sequence, but only the SELECT and FROM clauses are required.
Generally, the result of a SELECT statement is a subset of data retrieved by the RDBMS from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table.
Optimization is typically performed before the execution of SQL statements in order to improve query performance. Query performance is often dependent on the size and physical design of the database. Thus, a query optimizer in an RDBMS relies on statistics collections to accurately choose an efficient execution plan.
Typically, users are responsible for identifying the specific columns and indexes on which to collect statistics. For example, users may collect optimizer statistics on a specified column or index of a table using the SQL command syntax shown below:COLLECT STATISTICS ON<table_name><column_or_index>Other similar commands may have different but similar SQL command syntax.
The process of collecting statistics usually requires scanning and sorting all of the indexed or column data and is thus resource intensive, especially for large tables. In certain instances, this may mean that a significant percentage of overall system resources is devoted solely to collecting statistics. Hence, it is critical that users correctly identify only those columns and indexes that truly benefit from statistics collection.
Unfortunately, this is difficult for most users because it requires detailed knowledge of a query optimizer's plan selection process and runtime execution strategies. Furthermore, many users continually tune queries in their workloads and, over the course of time, it is not uncommon for them to accumulate a list of collections that includes hundreds or even thousands of individual columns and indexes. At that point, it is likely that a subset of these statistics collections is no longer necessary to achieve the same workload performance. Of course, it is nearly impossible to manually identify this subset.
Database vendors often provide tools or “wizards” that examine a user's workload and make recommendations regarding query optimizer statistics. However, the current focus of such tools is on the collection of additional statistics to improve query optimizer plan selection, which in turn improves workload performance.
What is lacking with such tools is the capability of recommending the removal of unnecessary statistics collections. More specifically, what is needed is an automated method of determining the minimal set of existing statistics collections necessary to retain the current level of workload performance. The present invention satisfies this need.