1. Field of the Invention
This invention relates to database management systems and more particularly relates to quantifying a data page repetition pattern for a database index in a database management system.
2. Description of the Related Art
There are two main types of database files, flat files and relational database files. Flat databases store data in a delimited text file. A typical flat database is delimited using a tab, coma, semicolon, or other special character. The delimiter separates database entries. A typical database entry contains multiple pieces of information called fields that are typically related in some way. For example, a tab delimited flat database file might include directory information, where a tab separates each entry. In this example, the entries probably contain a name, address, and phone number for each person in the directory. Unfortunately, searching flat databases generally wastes time and system resources, because each and every piece of information for every entry is typically scanned until the correct entry can be located.
An improved database organization method, commonly referred to as a relational database, emerged in the 1970s. The relational database arranges data in tables. The tables typically include columns and rows. In the example above, the directory entries may be entered in rows, such that the name is in the first column, the address in the second column, and the phone number in the third column for each entry in the relational database. A database index is a data structure that improves the speed of operations in a table. Indexes can be created using one or more columns. The disk space required to store the index is typically less than the storage of the table. Particularly, the index data structure may be created by copying a column of data from the table into a temporary storage device or memory. The index may include pointers, pointing to the entries associated with index key values. The index enables a Database Management System (DBMS) to more effectively search the database. For example, if a user of the directory database described above wants to find the phone number and address associated with a particular name in the directory, the DBMS may search a name index derived from the name column of associated tables to locate an entry associated with the particular name. Once the name is found in the index, the complete entry may be retrieved from the database by following the pointer associated with the index value. This search will be faster and more efficient because only the index corresponding to the name column is searched.
In a typical database, a predetermined amount of data is stored together on a storage volume in a group referred to as a data page. Data pages may be stored on a hard disk, array of hard disks, or other storage subsystem. Alternatively, data pages are stored on tape storage devices, or other storage media. The data pages are typically retrieved by the DBMS and cached in a temporary storage device such as Random Access Memory (RAM) which is used when the DBMS is searching for data.
Despite these improvements in database searching, system resource usage continues to be a common problem. For example, databases commonly used in corporations may include thousands or millions of entries. Additionally, the DBMS may have several users attempting to access these large databases simultaneously. Therefore effective system resource management becomes a problem. In such an example, a user typically submits a query to the DBMS which identifies the specific entry to search for, and the index to use for the search. If the database is larger than the capacity of the DBMS cache, the DBMS typically caches and searches a portion of the database at a time. If several users submit queries simultaneously, the number of data pages that can be cached for each search is reduced. These limitations typically impact system performance, search time, and the like.
In order to improve database performance, some systems include query optimization functions. A typical optimization function determines which database index to use and/or which data pages to cache for a specific query based on a database size statistic, a database query selectivity statistic, and a data clustering statistic. In a typical system, these three database statistics are collected during an offline collection process.
To illustrate these statistics, FIG. 1A illustrates a common relational database table. Each entry of the database table of FIG. 1A includes two fields, 1) the account number, and 2) the gender of the person associated with that account number. The account numbers are arranged in a column and the gender is arranged in a column. Each entry represents a row of the database table. The database table is separated into data pages, each page containing 10 rows of data. The complete database table includes 100 pages of data, which contain a total of 1000 entries. These values represent example size statistics.
In a common query, the DBMS finds entries associated with query criteria using the index. In this example, either the account number or the gender column may be used as the index for searching the database. Indeed, some combined indexes may include data from both columns. The DBMS determines whether an index keyed on account number or one keyed on gender and account number will be most effective (less costly) for satisfying the query. FIG. 1B illustrates an indexed keyed by account number. Each key value is associated with the page number that includes the associated row for the table illustrated in FIG. 1A. The query returns the page number where each record is found corresponding to the account number in the index. Since, the database of FIG. 1A is ordered in sequence by the account number, the index and the results of this query are order in sequence from page one to page one hundred. Additionally, since the data is sequential, the DBMS only needs to scan the index of FIG. 1B associated with the database of FIG. 1A once to identify every entry associated with the query.
Every page in the database of FIG. 1A includes exactly ten entries corresponding to an account number, and the account numbers are sequential. Therefore, the database has a cluster factor or cluster ratio of 100% with relation to the account number. The cluster factor is a composite statistic based on the density and sequential nature of the data in the database corresponding to a specific index. Cluster factors are commonly used in cost-based optimization processes.
FIG. 1C illustrates an index defined as gender by account number. The DBMS searches the entire index searching first by gender and then by account number. So for example, the first entry which includes female as the gender is found on the first row of the first page. However, the second entry is found on the tenth row of the first page. Since the account number is sequential, the results are still sequential, but the data is not as dense as the data returned from the query illustrated in FIG. 1B. Therefore, the cluster factor of this index is less than 100%.
While the cluster factor does provide cost information which is useful information regarding the sequential nature, the cluster factor does not provide a complete picture. For example, as illustrated in FIG. 1C, depending on the query, the DBMS may be required to load the entire database twice to identify all of the entries listed by order in the index. The first scan may identify all of the female entries by account number, and the second scan may identify all of the male entries by account number. The cluster factor does not provide any information regarding the number of times a data page must be referenced to complete the query, however, this information would be useful in a cost-based optimization process.