Computer database systems manage the storage and retrieval of data in a database. Databases include a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table includes a set of data records stored in one or more data fields. The data records of a table are often referred to as rows, and the data fields across the data records in a table are referred to as columns.
A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system. Users can write queries and run them against a database, for example, to retrieve desired information from the database. In general, to retrieve requested information, a database server can retrieve an appropriate table from storage, load it into memory, and search row by row for the desired information. However, as databases have grown very large in size, searching for data in this manner can be very time consuming and inefficient.
In order to make this type of search more efficient, indexes can be built on the database. An index is a subset of a table that typically contains fewer columns than the table itself. Indexes are sometimes created prior to a user query being made. Some indexes are arranged in a tree like structure which makes finding information even faster. If an index exists that contains the data that the user desires, it is much easier for the database server to just search the index to provide the information than it is to search through the entire table.
Another way to improve the performance of a query is to use a materialized view, also referred to as an indexed view. A materialized view represents a particular view of the data (e.g., the results of a query) that has been materialized, and it may be used to answer different queries. A materialized view has some characteristics that a traditional index does not have. Materialized views can be defined over multiple tables and can have selections and groupings made over multiple columns. As with traditional indexes, materialized views incur an overhead each time the tables referenced in the materialized view are updated. The use of indexes and materialized views can provide a dramatic performance improvement when retrieving, inserting, deleting, and updating data in a database.
The configuration or physical design of a database includes a set of indexes, materialized views, and indexes on materialized views used in the execution of a workload against the database. A workload is a set of queries and updates that are run against a given database. Given a workload of multiple queries, determining which indexes and materialized views to include in a physical design is very complicated and time consuming. Since there is overhead associated with generating, maintaining and storing the indexes and materialized views, this must be offset against the benefit obtained by using them.
Current commercial database systems typically offer two separate tools to assist database administrators with the task of choosing the physical design of a database. There are tools sometimes referred to as tuning components that, when given a query workload (i.e. a set of database queries and the individual queries' frequencies), suggest a physical design that optimizes the performance of the workload. In cases where the workload executing against the system is not known to the administrator, there exist additional database monitoring tools that observe the queries executed against the database system and log the corresponding query statements. Despite the existence of these tools, automating the tuning loop for database physical design has yet to be achieved.
A significant problem in automating the tuning loop for database physical design involves the collection of the query workload that is currently executing on the system. Conventional database servers monitor query workload and other database activity through two types of tools; an event polling tool, and an event logging tool. A polling tool pushes out a snapshot of all activities occurring in the database at a given moment, and then it aborts. Drawbacks with this tool are that polling the server too infrequently will cause the monitoring application to miss valuable information, while polling more frequently can impose significant CPU overhead on the server. Event logging tools allow an administrator to specify events for which the database can publish notifications and additional related information. Although logging tools capture all the specified events, significant overhead can be incurred by the database server. Commercial servers often execute a thousand or more queries per minute, so potentially large numbers of events may need to be written out of the server to a file or table, or sent over a network.
Thus, capturing query statement text and events associated with a query (e.g., query statement begin and end, lock acquire and release, user login and logout, etc.) can introduce significant performance degradation, since logging this information to stable storage requires significant overhead. In addition, existing tuning tools are designed for relatively small workloads and they do not scale up to workloads containing millions of queries collected by monitoring busy database servers over longer periods, even when techniques such as workload compression are employed.
Both of these scalability-issues can be addressed by using only a subset of the workload (which may be selected, for example, by sampling). However, the physical design resulting from tuning a subset of the workload may lead to significant degradation of the queries that have been omitted from the collected subset.
Accordingly, a need exists for a way to collect a subset of a query workload and tune a database physical design using the subset while not degrading the performance of those queries not included in the workload subset.