The present invention relates to database table indexing, and more specifically, to intelligent, adaptive index density in a database management system.
A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of slower writes and increased storage space. An index can be created using one or more columns of a database table, providing the basis for both rapid random look ups and efficient access of ordered records.
The density of an index measures the uniqueness of values within a data set. Density is determined by dividing the number of rows that correspond to a given key by the number of rows in the table. Indices may be classified as ordered indices which fall into two types: dense and sparse. In a dense index, an index record appears for every search key value in file. This index record contains the search key value and a pointer to the actual record. In a sparse index, index records are created only for select records. To locate a record, an index record having the largest search key value less than or equal to the search key value that is being looked for is searched. The record pointed to by the index record is the beginning location, and the pointers in the file are examined (sequentially) until the desired record is found.
Dense indices are faster in general, but sparse indices require less space and impose less maintenance for insertions and deletions. Further, having many dense indices will consume a lot of main memory and is costly in terms of bringing a block into main memory (i.e., more dense indices may not fit into limited space in main memory resulting in swapping data between memory and disk, whereas sparse indices tend to be smaller and so may be able to fit all its data into memory).
Access patterns for indices may be classified as clustered (i.e., sequential access patterns) and unclustered (i.e., random access patterns). An unclustered index is structured in a way that does not correspond to the order of the actual data records. It resembles the words index at the back of a book. For this reason, it typically does not perform as well as clustered indices on ranged searches where the result set is large, since each result might cost an additional input/output (I/O)-operation to get the actual data record.
Clustering alters the data block into a certain order to match the index; hence, it is also an operation on the data storage blocks as well as on the index. An address book ordered by first name resembles a clustered index in its structure and purpose. The exact operation of database systems varies, but because storing data is very redundant, the row data can only be stored in one order. Therefore, only one clustered index can be created on a given database table. Clustered indices can greatly increase overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index, or when a range of items are selected.
Since the physical records are in this sort order on disk, the next row item in the sequence is immediately before or after the previous one, and so fewer data block reads are needed. The primary feature of a clustered index is therefore the ordering of the physical data rows in accordance with the index blocks that point to them. Some databases separate the data and index blocks into separate files, others put two completely different data blocks within the same physical file(s).
Current storage disk technologies provide two types of hybrid disks: solid state disks (SDDs) and hard disk drives (HDDs). Solid state disks and other devices based on NAND flash memory allow many more random I/Os per second than conventional magnetic disks. Thus, in principle, they can support workloads involving random I/Os much more effectively. However, flash memory does not support general in-place updates. Instead, a whole data page is written to a new area of the device, and the old page must be invalidated. Groups of contiguous pages form erase units, and an invalidated page becomes writeable again only after the whole erase unit has been cleared. Erase times are relatively high. Flash-based memory does, however, allow in-place changes of 1-bits to 0-bits without an erase cycle. Thus, it is possible to reserve a region of flash memory initialized to all 1s, and incrementally use it in a write-once fashion.
A solid-state drive (SSD) is a data storage device that uses solid-state memory to store persistent data. SSDs are external storage devices designed on the basis of semiconductor memory. One advantage of SSD is that it does not contain any mechanical moving parts, in contrast to HDD which has moving parts. Other advantages of SSD over HDD include the access time of the SSD is much faster, the read speed of the SSD is higher, the SSD is lighter in weight, and the power consumption of SSD is typically lower.
As indicated above, dense indices are faster in general, but sparse indices require less space and impose less maintenance for insertions and deletions. Further, having many dense indices will consume a lot of main memory and is associated with high costs in terms of bringing a block into main memory. Recent memory management systems have improved the number of queries that a particular microprocessor can perform in a given unit of time. Furthermore, the use of multiple microprocessors and/or multiple networked computers has further increased the capacities of many database management systems.
From a software standpoint, the use of relational databases, which organize information into formally defined tables consisting of rows and columns, and which are typically accessed using a standardized language such as Structured Query Language (SQL), has substantially improved processing efficiency, as well as simplified the creation, organization, and extension of information within a database. Furthermore, significant development efforts have been directed toward query “optimization,” whereby the execution of particular searches, or queries, is optimized in an automated manner to minimize the amount of resources required to execute each query.
Through the incorporation of various hardware and software improvements, many high performance database management systems are able to handle hundreds or even thousands of queries each second, even on databases containing millions or billions of records. However, further increases in information volume and workload are inevitable, so continued advancements in database management systems are still desirable.