Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system. Due to the importance of the physical design several commercial database systems include tools for automating the choice of physical design structures such as indexes and materialized views to be included as part of the physical design of the database. These tools include the Tuning Wizard in Microsoft SQL Server 7.0 and 2000. One important aspect of these tools is that they take as input a workload consisting of SQL queries and updates and tune the physical design such that the performance of the workload is optimized. Such workloads can be obtained by using event logging capabilities in the database system, e.g. SQL Server Profiler in Microsoft SQL Server 2000. Often, the workloads provided as input to the tool can be large since the workload contains queries that execute over a relatively large window of time (e.g., a day or week). Faced with a large workload (i.e., large number of queries and updates), the running time of these tools can be become significant. In Tuning Wizard for Microsoft SQL Server 2000, there are three modes in which tuning can be performed: Fast, Medium, and Thorough. Each of these modes performs progressively more detailed analysis requiring more time, but produces better quality recommendations. However, physical design tuning is often done during a batch window, e.g., overnight, when the database server is not loaded or is offline. Note that none of the tuning modes can guarantee that tuning is completed within a specified period of time, and if the tuning process must be terminated prior to its completion, a significant amount of the work done by the tuning thus far may be lost.
In the realm of automatic database tuning, the task of index selection, i.e. automatically selecting an appropriate set of indexes for a given database and workload is challenging for several reasons. First, because database schema of real applications can be large and indexes can be defined on a sequence of one or more columns, the space of indexes that are relevant for the workload (i.e., can possibly be used to answer queries in the workload) can be very large. Second, today's query optimizers can exploit available indexes in sophisticated ways such as intersection of two or more indexes, use of covering indexes, etc. . . Thus it becomes important to take into account interactions among indexes. For example, the presence of a clustered index on each of the join attributes can result in a very efficient plan for the join (using the clustered index on each of the join attributes can result in a very efficient plan for the join (using the Merge Join operator)), whereas the presence of only one index but not the other may result in a plan with significantly higher cost. Third, the choice of indexes must be done while honoring certain constraints, e.g. limiting the amount of storage space allotted to indexes, and ensuring that at most one clustered index per table can be chosen. Thus index selection can be viewed as an optimization problem where the goal is to find the most beneficial subset of the relevant indexes for the workload that obeys the given constraints.
Recently, several major database vendors have begun shipping index selection tools to address the issue of interactions among indexes. Microsoft's SQL Server product uses a hill climbing algorithm, referred to as Greedy(m,k), as the search strategy. Greedy(m,k) starts by enumerating all sets of size m from the candidate set. It chooses the optimal set of size m and then greedily (based on the additional benefit accrued given the current configuration) adds the indexes until the size of the set reaches k. The value of k is determined by storage constraints. One drawback to the Greedy(m,k) approach is that it cannot provide a guarantee about the quality of the solution produced. This approach also lacks formal analysis and can end up doing significant work since it enumerates overall sets of size m. Moreover, even in the greedy phase, it evaluates every candidate index against the entire workload to find its incremental benefit.
For automatic index selection, IBM's DB2 product implements a Knapsack based algorithm underlying the index selection tool. This approach converts the Index Selection problem to a 0–1 Knapsack problem and then uses the greedy algorithm. The crucial step in the Knapsack approach is the assignment of benefits to indexes. In the DB2 approach, the benefit on an index belong to the best set of indexes for a query, is defined as the entire benefit of the winning set of indexes for that query. Also, the initial solution obtained from the greedy Knapsack is refined by calling a routine called “TRY_VARIATION” that creates a variant of the solution by randomly swapping a small set of indexes in the solution for a small set of indexes not in the solution. Like Greedy(m,k) IBM's Knapsack is unable to provide any guarantees about the quality of the solution produced. In addition, indexes that do not make a big difference to a query may be assigned a high benefit just because they are part of the winning configuration for that query meaning that this approach may overestimate the benefits of unimportant indexes.