Present invention embodiments relate to database access techniques that minimize costs associated with maintaining database indices with respect to costs associated with maintaining range (or zone) maps.
Searching for information using a query may result in a search of a large database table when an evaluation of the query indicates that the large database table should be scanned. In many data warehouse appliances, the hardware and software resources are balanced so that all components (e.g., storage, networking, CPU and accelerator resources) are busy when performing a large database table scan. The performance of such queries is improved not by optimizing any one component of the data path, but by identifying subsets of the rows with column values which do not need to be scanned. The “zone map” mechanism enables one form of resource optimization, while indices or indexes represent another form of data mapping.
For example, a zone map comprises metadata about each storage region in the form of range values or range maps that define minimum (min) and maximum (max) values for ranges in a given storage region. In other words, a zone map may comprise a plurality of range maps and as such, both zone maps and range maps comprise min and max values for a storage region (set of data) and in that sense the terms may be interchangeable. With respect to indices, an index comprises metadata that may point to a particular record or the start of a range of records within a storage region. Thus, in one sense, a range map may be considered as a range within which a given value may be found, should that value exist (i.e., the zone map range is inclusive, but there may not be an actual record within that range for a given index value). The zone map may be used to filter storage regions for the possible inclusion of desired content before actually reading and searching the stored data. In other words, a zone map may be used to eliminate rows in the large database table from consideration early in the scanning sequence before any unnecessarily large processing overhead has been incurred. In one example, if a storage region is known to contain records with column values between 100 and 200 (e.g., as stored in the range map metadata), then when a query with range values outside of that known range (e.g., a query with a value of 500) is evaluated, the evaluation can eliminate that storage region from being scanned.
The size of a given storage region for which metadata is maintained defines a resource tradeoff between processing resources and storage resources. For example, if the storage region is relatively large (e.g., three megabytes (MBs)), then a query that would otherwise return a relatively small data subset (e.g., 128 kilobytes (KBs) or 0.125 MB) requires loading the entire 3 MB storage region into memory for scanning, thereby increasing the use of memory resources, interconnect bandwidth from storage into memory, and storage device read bandwidth. If the scan region covered by the metadata is relatively small (e.g., 128 KB), then the volume of metadata becomes larger relative to the size of the metadata describing a 3 MB data storage region, thereby increasing the use of processing resources used to analyze a large amount of metadata. In this regard, a zone map itself requires storage (e.g., for example 24 bytes per storage region). If the scan region for each range map is too small, then the overhead incurred for creating and processing zone maps may become relatively large. Accordingly, zone maps and indices may become inefficient in certain respective situations.
Regarding indices, these data constructs can be inefficient to keep up-to-date. For example, upon each insertion of a row into a table, every index for that table requires an update and it is desirable to reduce the number of indices. On the other hand, indices can be much more efficient than zone maps for some queries. By way of example, consider a table with 1 billion rows, with a unique key, but stored in random order with respect to that unique key. An index will identify a single storage region to be read for any query on a specific key value. In contrast, a zone map will likely require reading almost all storage regions for the same questry, because each storage region contains a wide range of keys due to the rows being stored in random order. Thus, it is desirable to have a mechanism for automatically identifying such situations when indices are more effective, and maintaining an index in only these identified situations.