With the amount of information often desired to be stored in a database system increasing, data or complete records are often stored in more than one database storage site. One important aspect of the database programs is the ability to provide fast and efficient access to records in each individual database. To properly handle the distribution and retrieval of the data, data processing systems often include database management programs. These programs provide easy access to database information that may each consist of a multiple of records stored at many nodes or sites. Relational database management programs provide this capability.
One common configuration of a database is one that is made up of various tables with each table containing rows and columns of information. The information stored across one row in the table would make up one record and the fields of the record would be the columns in the table. In other words, the table would contain rows of individual records and columns of record fields. Because one record may contain more than one field of information, the information of the field would make up the columns of the database table. Other database configurations are found in the art. Database management programs support multiple users thereby enabling each user to access the same table concurrently.
An index file is commonly used by database management programs to provide quick and efficient associative access to a table's records. These index files are commonly configured in a B-Tree structure which consists of a root node with many levels of nodes branching from the root node. The information contained in these nodes may include pointers which point to the nodes at the next level of the tree or it may include pointers which point to one or more records stored in the database. These pointers include additional key record information which may reference the records stored in the database. The record keys are stored in an ordered form throughout the nodes at the various branches of the tree. For example, an index tree may exist for an alphabetic listing of employee names. The root node would include reference key data that relates to individual record information that may be indirectly or directly referenced by the next level of nodes in the tree. The reference keys contain information about the index field, e.g., the alphabetic spelling of the employee's name. Therefore, the ordered keys in the root node would point to the next successive level of nodes. In other words, the next successive node may indirectly or directly reference all employees names beginning with A, B, and C. A next successive node, parallel with the first successive node, may contain employee records whose last name begins with the letters D-M. The last successive node on this level would reference records of employees with last names starting with N-Z. As one searches through the index tree, a bottom node is eventually reached. The contents of the bottom node may include record key information that further points to individual records in storage or may point back to one of the branch nodes in the tree.
For parallel databases or distributed database systems, the problem of accessing a table partitioned across multiple storage sites becomes more complicated. One or more partitions of a table may be stored in a single site. Each partition of the table typically is associated with a group of physical storage devices. Typically a partition is a horizontal portion of a table's records. The motivations for horizontally partitioning a database object may be to partition a very large table of information, such as all the employees' information for a large corporation, among multiple storage sites so as to facilitate parallel processing of a user's query or to allow each node to retain efficient access to its own locally stored records. Another motivation may be to partition a large database table across multiple storage sites so as to facilitate better administration of the physical storage volumes.
A database object may be partitioned either horizontally or vertically according to the content of its records and fields. A horizontal partition would mean that certain rows of the table would be stored at one storage site while other rows in the table are stored at other storage sites. A vertically partitioned table would have certain columns or fields stored at one storage site while other fields would be stored at other sites. Separate index trees might be built for each of the partitions. One tree may contain the names and addresses of employees A-J while another tree contains names and addresses of employees K-L and so on. In such a manner, very large volumes of record information can be stored across multiple storage sites with the table partitioning method depending on the type of information stored and the application.
A Relational Database Management System (RDBMS) may be used to manage the table information that has been distributed across multiple partitions or nodes. In the case where a database table is partitioned according to the content of its records, one or more fields of a particular table record can be designated as the Partition Key of that individual record. One case might be to designate the employee serial number as the partition key of that employee's record and store in each partition a set of records containing serial numbers within a certain range of values. A different partitioning criterion may group the records directly by their Partition Key Values, which might be some other piece of information contained in the record, such as the employee's work location, and may further determine a partition by hashing on the value of the work location field. On the other hand, a database table may also be partitioned using a non-content based criterion, such as some inter-table relationship that is not related to the information contained in the employee's record, but rather a insertion storage site or node.
One problem in the art has been to support associative searches efficiently. Indexes are often maintained on the search field or fields of the stored data in order to provide associative search efficiency.
An index typically consists of a separate table or list of entries having the form (INDEX KEY, RECORD POINTER). This index table is typically ordered by the value of the INDEX KEY which might be some particular piece of record information, and is typically configured in a B-Tree structure as described above. The value of the INDEX KEY may be the employee's serial number or some other record information. An ordering of the index table by the value of the INDEX KEY entry facilitates the search by narrowing the list of candidate records and thereby reducing the access time to the record or records for the user requesting it or them. The RECORD POINTER is the other index table entity which can be a piece of information of fixed-length such as a system-assigned token called a Record Identifier (RID). In some database configurations, the RECORD POINTER may be user-provided. In any case, the RECORD POINTER uniquely identifies a data record.
For partitioned data, an index called a Local Index may be maintained separately for each individual partition of the table. If no single index which references data in multiple partitions is maintained, then this is known as the Local Index Only solution to the associative access efficiency problem. The Local Index Only solution is a simple way to provide indexing capability for partitioned data. In this solution, the Local index may be a table or list similar to the index table previously discussed.
The simplicity of the Local Index Only solution comes with a severe performance penalty which is disabling in very large databases. Since only local indexes exist at each partition site, most access requests are broadcast to all the partitions for processing. Each node has to check its table to see if the desired record information exists at that node. The Local Index Only solution also requires that all partitions of a table be available in order to properly evaluate most access requests. Over a system with multiple nodes, precious processing resources may be consumed by useless activity.
Processing resource is not only consumed by the user. The access requests sent to each site are not always only those explicitly specified by the users. There may be low-level requests generated by the database management system in processing and evaluating the higher-level user requests. There may also be system requests to enforce certain database constraints which maintain referential integrity across the multiple storage sites. In addition, an access request may need to obtain certain information that pertains to the entire object such as checking the existence of a particular key value to enforce key uniqueness. Moreover, the query response time is lengthened because of the time spent waiting for all the local nodes to complete their respective operations before undertaking the next set of instructions. A longer query wherein a lot of information across multiple partitions is accessed may cause significant performance degradation. The performance impact may increase quickly with the number of partitions a table is made up of. As a result, the database workload is significantly increased and the system efficiency throughput is ultimately reduced making such a system appear sluggish to the user.
Because of the useless activity, the Local Index Only approach is not a scalable solution to associative searching. In other words, the Local Index Only solution does not continue to perform well as the number of partitions of the table begins to increase dramatically.
To provide more efficient indexing support for partitioned data, a Full Global Index, which is an index covering all the partitions of the indexed table, can be utilized. A Full Global Index contains at least one entry for each object of interest in the table thereby having a one-to-one relationship with every object of interest in the entire table. One approach to global indexing is called the Primary Key Approach, wherein a Global Index is maintained as a list of entries having the form (INDEX KEY, PRIMARY KEY), wherein the primary key is the partition key. In this case, each data record is uniquely identified across all partitions by a user-provided PRIMARY KEY value. The Partition Identifier (PID) of the targeted partition can be determined using the PRIMARY KEY value in conjunction with the partitioning criterion. Because the records must be stored in a way that allows them to be retrieved using the PRIMARY KEY only, this leads to a database design in which a clustering Local Index is maintained on the PRIMARY KEY, with the PRIMARY KEY being the value of the Clustering Key. One approach is to store the record itself in the tree of the index and possibly avoid an extra Input/Output operation for accessing the records.
An alternative to the Primary Key Approach is the Partition Key Approach, wherein a Global Index entry is in the form of an (INDEX KEY, PARTITION KEY). The PARTITION KEY is not the Primary Key, otherwise, it is the same as the Primary Key Approach discussed above. The PARTITION KEY must be unique else a selection predicate applied to the INDEX KEY must be re-applied to the retrieved records in order to assure correct data retrieval. This often leads to a database design in which a clustering Local Index is maintained on the PARTITION KEY with the PARTITION KEY value being the Clustering Key value. A Global Index improves the efficiency of the evaluation of the user's query by allowing an access request to be redirected only to the relevant partitions, and by providing globally available INDEX KEY information quickly. It has the drawback of increasing the index management cost, which is the cost to the database management system to assure consistency between index tables and the data records, especially in the configuration wherein each partition is stored at a separate database storage site.
Other papers that the applicant believes are pertinent to an understanding of the background of this invention include the following: Levine et al., "Method For Concurrent Record Access, Insertion, Deletion and Alteration Using An Index Tree", U.S. Pat. No. 4,914,569, (Apr. 3, 1990) wherein a method for fetching key record data in a group of record keys according to at least a portion of a key record through an index tree, which provides concurrent accesses of record keys by different transactions, is disclosed;
Mohan, "ARIES/KVL: A Key-Value Locking Method for Concurrency Control of Multiaction Transactions Operating on B-Tree Indexes", Proceedings of VLDB, August 1990, wherein index key value locking and the lock state replication via next key locking is discussed;
Mohan and Levine, "ARIES/IM: An Efficient and High Concurrent Index Management Method Using Write-Ahead Logging", Proceedings SIGMOD Conference, June 1992, wherein index entry locking and left side propagation of the uncommitted first instance is discussed:
Mobart, "COMMIT.sub.-- LSN: A Novel and Simple Method for Reducing Locking and Latching in Transaction Processing Systems", Proceedings of VLDB, August 1990, wherein the COMMIT.sub.-- LSN idea is discussed; and Mobart, Haderle, Wang, and Cheng, "Single Table Access Using Multiple Indexes: Optimization, Execution and Concurrenty Control Techniques", Proceedings 2nd International Conference on Extending Database Technology, Italy, March 1990, wherein index ANDing/ORing and re-evaluation of predicates are discussed.