Databases have grown very large in size. When a user desires information, they issue a query, which asks for information, such as give me the names of people from Bozeman, Mont. with income greater than a certain amount. There may be a table containing a row for each person in the United States. Each row may have several columns, including income, social security number, street address, city, state, telephone number, names of dependents, occupation, etc. By searching this large table, a computer can find all the people in Bozeman, Mont., and then determine from that list of people, who has income greater than the specified amount. To do this search, the computer has to deal with the entire large table, retrieving it from storage into memory, and then searching row by row for the desired information.
One way to make this type of search more efficient, is to build indexes. An index makes it more efficient to retrieve a subset of data from a table required to answer a query. It helps identify subsets of rows from which further data may be required to answer the query. An index is a subset of a table, which typically contains fewer columns than the table itself. Indexes are sometimes created prior to a user query being made. Some indexes are arranged in a tree like structure, such as a B.sup.+ tree, which makes finding information even faster. If an index exists that contains the data that the user desires, it is much easier to simply search the index to provide the information.
In the case of the above query, an index could be generated based on State, and then people in Montana could be quickly identified. This index would be much smaller than the table, and the answer could be provided much more quickly because the entire table does not need to be searched. An index with income can also be generated to quickly identify people with income greater than the specified amount. An index on city could also be created to quickly identify people in the city of Bozeman. In each of these cases, the table would have to be consulted absent further relevant columns in the index.
A covering index for a query may also be used. A covering index for a query is an index that is a subset of the large table, yet has at least all the columns from a table needed to answer the query. For the example query, an index having city, state, people and income columns would be a covering index because it has all the columns needed to answer the query without resort to the table. Use of this covering index would be the fastest way to answer the query. However, it is likely larger than the other indexes.
A workload is a set of queries and updates which are run against a given database. A configuration is a set of indexes used to improve performance for the workload. Given a workload of multiple queries, the decision as to which indexes to include in the configuration is very complex and time consuming. Since there is overhead associated with generating, maintaining and storing the indexes, this must be offset against the benefit obtained by using them.
A query optimizer is used to obtain information about indexes given a set of queries. Optimizers are database tools that can take a representative workload, and return information about a plan of execution of the queries in the workload, as well as information about the cost of executing the plan. The optimizer provides detailed information in the form of a tree which has nodes corresponding to a plan of execution of the query. The nodes provide information on the indexes used, and the cost of using the index. From this cost information, a user can try to determine which indexes should be built or used to enable the workload to be processed in an efficient manner.
One way of determining which set of indexes provides the most benefit given a workload is to actually propose a set of indexes, build them, and then run the workload. The total cost of the workload is then calculated. Several different sets of indexes are measured in this manner to find the best set. This is a very expensive proposition, especially when it would need to be done several times for different sets of indexes. In other prior database systems, optimizers have been run for each proposed configuration or set of indexes to try and find the best set for a given workload and database system.
This is also a costly way to try to determine which set of indexes provides the best performance of the database system. As the workload changes, the entire process must be repeated to arrive at a new set of indexes.
A further approach takes semantic information such as uniqueness, reference constraints and rudimentary statistics ("small" vs. "big" tables) and produces a database design. Such designs may perform poorly because they ignore valuable workload information. A further approach uses an expert system like approach, where the knowledge of "good" designs are encoded as rules and are used to come up with a design. This approach can take workload information into account, but is disconnected from the query optimizer. This has adverse ramifications for two reasons. First, a selection of indexes is only as good as the optimizer that uses it. In other words, if the optimizer does not consider a particular index for a query, then its presence in the database does not benefit that query. Second, the expert system operates on its own model of the query optimizer's index usage. While making an accurate model is itself hard, ensuring consistency between the tool and an evolving optimizer is a software-engineering nightmare.
In yet a further approach, an index selection tool uses the optimizer's cost estimates to compare goodness of alternative hypothetical designs. This approach avoids asynchrony between the index selection tool and the query optimizer. It further reduces the amount of indexes to consider by eliminating a large number of indexes from consideration that provide little or no benefit for any query in the workload. It also searches subsets of candidate indexes and picks a configuration with a low total cost. Multi column indexes are also generated for consideration in this approach.
There is a need for a system that can more quickly determine a set of beneficial indexes given a workload. There is a further need for a system that does not need to repeatedly call an optimizer to determine a set of beneficial indexes.