1. Field of the Invention
The present invention relates generally to data processing environments and, more particularly, to the automated determination of optimal indexes to use for a given workload environment.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of database management systems is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Addison Wesley, 2000.
For enhancing the storage, retrieval, and processing of data records, database systems typically maintain one or more indexes on database tables. A variety of indexes may exist or be created in a database system. For example, an index may be created on a table to support a primary key. Additionally, secondary indexes may be employed for various reasons, such as to improve performance of the database on a particular type of query. For instance, an index created on columns or groups of columns in a table may enable the page containing rows that match a certain condition imposed on the index columns to be located without requiring the database engine to scan all pages in a table to find matching rows.
One of the primary purposes of database systems is to provide information in response to requests submitted by users in the form of database queries. Database systems are typically used to answer both known queries and ad hoc queries. A known query is a query that is frequently run on the system (e.g., on a daily, weekly, or monthly basis). An ad hoc query is a one-time or infrequently run query which usually seeks information not provided by available known queries or reports. During query execution, whether for a known query or a totally ad hoc query, a database system's optimizer must decide at some point which index or indexes (if any) are available for expediting processing of the query.
The benefit of having indexes available is that the optimizer may use them to improve performance for certain types of queries. Over the years, various techniques have been developed to improve an optimizer's ability to select one or more indexes for query processing. There are two main ways that an optimizer may use an existing index to improve processing of a query. The optimizer may use an index to satisfy some ordering or grouping property of the query, or it may use the index to satisfy some predicate(s) that are on the query. With a modern database system, one can safely assume that if an index exists, then the system's optimizer is at least smart enough to check whether it can use that index to its advantage.
Indexes however are not without disadvantages. For example, indexes take up disk space (which can be rather substantial for large databases). Also, indexes incur a maintenance penalty, due to the fact that the indexes must correctly reflect any changes to the underlying table data that they are indexing. Therefore, for a table having a lot of new rows inserted or rows deleted, the database system must incur cycles to maintain the corresponding indexes as rows are inserted and/or deleted.
The ability to “fine tune” a database system has existed for many years. Database systems have provided statistics that allow database administrators to tune or adjust system configuration, in order to achieve better performance. More recently, database vendors have added some degree of automation to the process. For example, Microsoft provides a utility that can recommend indexes. However, that utility runs external to the database engine and typically must perform a substantial number of passes/iterations before reaching a useful recommendation.
IBM has an approach that uses the notion of a virtual index. As the core task, a tool must decide whether to create an index with a given signature. In IBM's system, a virtual index is created inside the database engine, which appears to the system's optimizer as an index available for use. However, the structure is virtual, not real, meaning that it only contains enough in-memory information for the optimizer; it is not a physical structure that must be stored on disk.
The IBM approach has disadvantages, however. The way the IBM system generates virtual indexes relies on a limited series of fixed permutations of the elements of a given query. This can cause very good indexes to be missed and may potentially require a huge number of candidate indexes to be considered. As a result, the performance of the IBM approach is suboptimal.
The IBM approach has an additional problem. In the task of selecting indexes, there is a notion of “separability.” This means that some indexes may be useful only if other indexes exist. Therefore, a system cannot achieve optimal results by simply recommending a list of indexes and then indicating that top performing ones should be picked. Here, the removal of some of the indexes (i.e., removal from consideration) may cause some of the better performing ones to also be unavailable for use. This can happen because the optimizer may require certain combinations of indexes to be present in order to use them.
IBM attempted to solve this problem with their implementation as follows. The IBM system creates a set of indexes in the database engine, and then the system attempts to pare down the list of indexes to meet the size constraint specified. Once a recommendation satisfying the given constraints has been made, the system tries to improve upon it by replacing small numbers of indexes in the proposed solution, accepting the modified solution if it appears better. As combinations are tried, the IBM system re-evaluates the benefit (e.g., cost benefit) provided by each index. It will continue considering perturbations of the solution until the user instructs it to stop. However, that approach does not adequately capture the notion of separability, since the cost model employed in selecting the base solution is not updated to take into account indexes that have been removed from the set until this final stage of trying permutations.
What is needed is an approach that recommends a set of indexes providing the best performance value, both in terms of storage (disk space) costs and maintenance costs (as data is updated). Further, the approach desired is one that supplies recommendations faster than currently-available solutions and gives better recommendations. The present invention fulfills these and other needs.