The present invention relates generally to the field of computerized data processing, and more particularly to optimizing data organization of tables and indexes in databases, based on data distribution and growth rate of index keys.
A relational database is a digital database whose organization is based on the relational model of data. This model organizes data into one or more tables, or relations, of rows and columns, with a unique key for each row. Generally, each entity type described in a database has its own table, the rows representing instances of that type of entity and the columns representing values attributed to that instance. Because each row in a table has its own unique key, rows in a table can be linked to rows in other tables by storing the unique key of the row to which it should be linked. Data relationships of arbitrary complexity can be represented using this set of concepts. The various software systems used to maintain relational databases are known as Relational Database Management Systems (RDBMS). Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.
A table in a relational database may have zero or more indexes. An index is a data structure (most commonly a B-tree) that exists physically on disk. It generally contains one or more ordered columns of the table and logical pointers to data rows in the table. An index is organized in a way that enables direct and fast access to the data in the table. An index key is the set of columns identified in the description of the index. In the absence of an appropriate index, a query may incur a costly full table scan.
In a typical SQL database, the fundamental unit of data storage is the page, or block. The disk space allocated to a data file in a database is logically divided into pages numbered consecutively. Disk I/O operations are performed at the page level. That is, the database server reads or writes whole pages. Data pages include data rows of a table, typically put on the page serially, starting immediately after a header with control information. Index pages include index entries.
If an index is defined for a key with column values that are not continuously increasing (i.e., if key column values for rows newly inserted into the associated table are likely to be less than the key's current maximum column value), then throughput for high-volume insert operations may be constrained by index page split activity. Unlike rows in a table, entries in an index are physically ordered by key column value, which leads to index page splits when new values are inserted into the middle of an index, as will be the case, for example, for an index defined on a non-continuously increasing key column. If the page into which a new index entry must go is full, the page will be split, and a portion of the page's entries will be moved to a previously empty page in the index. Such index splits are generally considered a costly process.
A table space is a storage location where the actual data underlying database objects can be kept. It provides a layer of abstraction between physical and logical data, and serves to allocate storage for database objects, such as tables and indexes, that occupy physical space.
A clustering index (sometimes called a clustered index or index-organized table) determines how rows are physically ordered, or clustered, in a table space. That is, for a table with a clustering index, the order of the rows in the data pages corresponds to the order of the rows in the index. Clustering indexes may provide significant performance advantages for some database operations, particularly those that involve a range of rows. When a table has a clustering index, an insert statement typically causes rows to be inserted as nearly as possible in the order of their index values. As a result, rows in a column are likely to be close together, such that the database server can generally access all the rows in a single read. However, using a clustering index does not guarantee that all rows for a given column value are stored on the same page. The actual storage of rows depends on the size of the rows, the number of rows, and the amount of available free space. Moreover, some pages may contain rows for more than one column value. In what follows, the term index will generally refer to a clustering index.
Reserving free space in table spaces may enable data to remain clustered longer between reorganizations and can reduce the number of overflow rows and indirect references, which can harm performance. Unlike index pages, table space pages are not split. Free space is needed for inserts in clustering order and updates where the size of the row has increased and there is not enough room on the page. For inserts, the row will go to another page. For updates, the row is moved to another page, but the indexes still point to the original page, which in turn points to the new location. This is called an indirect reference.
Reorganizing tables and indexes restores physical continuity of table data and rebuilds indexes into unfragmented, physically contiguous pages. However, reorganizations are time consuming and a database table may be unavailable during reorganization. As a result, the frequency of table/index reorganization may only be, for example, weekly or monthly.
When insufficient free space is available for insert or update operations, new rows may be appended at the end of a table, out of clustering sequence. When updates to existing rows cause them to no longer fit on the original page, indirect references to one or more overflow rows on different data pages may be created. When many of these rows are physically located out of sequence, performance suffers. On the other hand, specifying too much free space may also have disadvantages, including using more disk space for the same amount of data, transferring less data in a single I/O operation, and the data occupying more pages that must be scanned.