A database is a collection of stored data that is logically related and that is accessible by one or more users. A popular type of database is the relational database management system (RDBMS), which includes relational tables made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
To extract data from, or to update, a relational table in an RDBMS, queries according to a standard database-query language (e.g., Structured Query Language or SQL) are used. Examples of SQL include INSERT, SELECT, UPDATE, and DELETE.
As applications become increasingly sophisticated, and data storage needs become greater, higher performance database systems are used. One example of such a database system is the TERADATA® database management system from Teradata Corporation. The TERADATA® database system can be a parallel processing system capable of handling relatively large amounts of data. In some arrangements, a database system includes multiple nodes that manage access to multiple portions of data to enhance concurrent processing of data access and updates. Concurrent data processing is further enhanced by the use of virtual processors, sometimes referred to as access module processors (AMPs), to further divide database tasks. Each AMP is responsible for a logical disk space. In response to a query, one or more of the AMPs are invoked to perform database access, updates, and other manipulations.
A physical storage structure that is provided by some database management systems is an index. An index is a structure that provides relatively rapid access to the rows of a table based on the values of one or more columns. An index stores data values and pointers to the rows where those data values occur. An index can be arranged in ascending or descending order, so that the database management system can quickly search the index to find a particular value. The database management system can then follow the pointer to locate the row containing the value.
The advantage of having an index is that it speeds the execution of SQL statements with search conditions that refer to an indexed column or columns. Generally, it is desired to create an index for columns that are used frequently in search conditions (such as in the Where clause of a SELECT statement).
In a database management system, primary and secondary indexes can be defined for each table. In a database system having multiple access modules, such as AMPs in a TERADATA® database management system, the primary index is used for assigning a data row to a particular one of plural AMPs. In effect, the primary index determines the distribution of rows of a table across multiple AMPs of the database system. Secondary indexes are used by a database system to more quickly identify portions of tables that are to be accessed in response to a database query.
Proper selection of indexes (such as the primary and secondary indexes) is important for optimal database performance. This is also referred to as the index selection problem, which can be a difficult problem when applied to a sophisticated parallel database system. Conventionally, many database designers rely mostly on their application experience and intuition to manually make index design decisions. With the increasing complexity of some database applications (e.g., data warehousing applications, which contain thousands of tables, indexes, and complex queries), the ability of a database designer to effectively perform tuning of indexes becomes increasingly difficult.
The problem of selecting indexes is made even more complex in the context of a multi-system database management system that has multiple database systems. For example, due to the size and nature of multi-system database machines, the search space of candidate indexes becomes very large such that the computations associated with traditional search algorithms used by conventional index selection tools are prohibitively expensive.