1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to a method and system for compressing varying-length columns during index high key generation.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
A typical database management system includes both database files and index files. The database files store data in the rows and columns of tables stored on data pages. In such a table, the rows may correspond to individual records while the columns of the table represent attributes of the records. For example, in a customer information table of a database management system, each row might represent a different customer while each column represents different attributes of the customers, such as the name of each customer, the amount owed by each customer and the cash receipts received from each customer.
Instead of providing for direct sorting and searching of the records in the tables, the database management system relies on the index files which contain information or pointers about the location of the records in the tables stored in the database files. The index file can be searched and sorted (scanned) much more rapidly than can the database files. An index file is scanned through transactions in which criteria are stipulated for selecting records from a table. These criteria include keys which are the attributes by which the database finds the desired record or records using the index. The actions of a transaction that cause changes to recoverable data objects are recorded in a log.
In database management systems all data are stored in tables on a set of data pages that are separate from the index file. A table can have one or more indexes defined on it, each of which is an ordering of keys of the row of the tables and is used to access certain rows when the keys are known. An index is often implemented with a tree structure consisting of leaf pages and non-leaf pages. A page is a physical unit of transfer between main storage and secondary storage. A non-leaf page contains a list of page numbers of other index pages, along with the high key values for those pages. A leaf page is the lowest level of the index tree. It consists of keys and their associated row addresses which are pointers to the rows in the table that have the given key value.
Generally, all of the indexes for a table are stored in an index file which contains records of only the key values and record identifiers (RIDs) of database table records containing these key values. An RID consists of a data page ID concatenated with a sequence number unique within that range. One common type of index files is a B-tree having N levels of nodes or pages. The starting node at the top of the tree is called the root node and defines the interval of key values that the B-tree index covers. In the successive lower levels of nodes, before the lowest level of nodes, this key value interval is broken up into key value sub-intervals. The leaf nodes or pages in the lowest level of the tree contain the individual key values within the interval, together with the associated record (row) identifications (RIDs) that enable the records having those key values as attributes to be located in the tables of the database files. The leaf pages of an index contain entries (keys) each of which is conceptually a {key-value, RID} pair where the RID is treated as if it were an extra key field. A non-unique index is one that may contain more than one key with the same key value. In contrast, a unique index cannot contain more than one key with the same key value. Keys are maintained in an ascending collating order on all key fields. Leaf pages alone contain next-page and previous-page pointers so that ascending and descending range scans can be supported. Non-leaf pages contain child page pointers. FIG. 1 illustrates a non-unique B-tree index.
In current database technology, when an index leaf page splits because it is too full, a new high key value is inserted into the non-leaf page. If the new high key is copied from the first key on the new page, the entire key is usually copied although some of the data maybe unnecessary. For non-leaf searches, just the significant portion of the high key needs to be stored to distinguish key values in consecutive children of the non-leaf pages. By comparing the last key value of the splitting leaf page with the first key value of the new leaf page, the differing byte between the two key values can be determined. Only data of the first key value up to and including the differing byte needs to be stored as the new high key value in the non-leaf page. The rest of the data is unnecessary for the non-leaf searches and can be truncated.
However, when the non-leaf key has varying-length columns that contain padding data, all the blanks up to the differing byte are stored in the key. Some of this padding data is unnecessary for the non-leaf key comparisons and can be removed. Even if the actual data processing time is very short, some operations may require a considerable amount of time as data must be stored and retrieved from larger storage space and must be input by the user requesting the transaction. Accordingly, it is important that the database management system permits the data processor to truncate the high key suffix data and store only the index key data necessary for the non-leaf comparison during searches.
Therefore, there is a need for a simple, optimized and generic method and system for compressing varying-length columns in non-leaf page index keys during high key generation. This compression method would require less key space in the non-leaf pages so more non-leaf entries can be stored in index trees. As a result, the non-leaf fanout will be increased by reducing the number of index levels, thus allowing faster index tree traversals in database management systems.