Record identifiers are used extensively within a database system to represent various kinds of links or references between records. For example, a record describing a banking transaction may contain a reference to the record of the customer who initiated the transaction, the record of the branch that handled the transaction, or other records. The references are typically implemented by storing the referenced record""s identifier in the referencing record.
In practice, many different types of identifiers are used to implement references. While some identifiers offer flexibility, others offer efficiency. For example, xe2x80x9clogical identifiersxe2x80x9d do not provide information about the actual physical location of a record in memory. Thus, logical identifiers require an additional and costly lookup to locate a record. Although inefficient, logical identifiers offer significant flexibility, since relocating records may be provided simply by updating the structure (e.g., B-tree or hash table) that maps identifiers to actual physical locations.
In contrast, identifiers that provide physical address information, such as actual disk addresses, are extremely efficient but lack flexibility. For example, physical address identifiers do not facilitate the movement of records and therefore do not accommodate file growth.
A partitioned or distributed environment such as a database or file adds a level of complexity that presents additional efficiency and flexibility challenges in assigning and looking up record identifiers. For example, providing an efficient lookup in a partitioned file is particularly challenging since an extra level of identifier translation may be required to determine the partition containing a record. Furthermore, there are greater flexibility requirements since there are more ways to reorganize a partitioned file in response to adding, dropping, or merging partitions.
Conventional methods, such as hashing and range methods, perform record addressing and data access in partitioned files. Generally, with hashing, an index into a hash table is created by applying a formula to each data record to produce a numeric identifier, generally referred to as a hash key or index key. A hash key is used to index into a table to find all records associated with the key. However, an additional method must be used to locate the desired record among the records that possess the same hash key. For example, a customer number could be used to identify the desired records among a set of records with a specified hash value.
Although highly flexible in accessing data, hashing is inefficient if there is a change in the number of file partitions. More specifically, to re-index the file it may be necessary to perform the costly and inefficient process of re-computing the hash value of each data record. Furthermore, re-computing each value is likely to move data records between partitions unnecessarily. When the number of partitions changes, it is costly to remap records to partitions while maintaining balanced partitioning where the records are distributed evenly between the partitions. Accordingly, hashing provides an inefficient and costly method for indexing and accessing data if the number of partitions of the file may change.
If a file is partitioned and the number of partitions changes, the range method of indexing and accessing data is similarly inefficient. Generally, with ranges, marks are set that assign contiguous values among partitions of a file. For example, if there are two partitions, all values having a first character within the alphanumeric range of xe2x80x9ca-mxe2x80x9d may be assigned to partition 1, and all values having a first character within the alphanumeric range of xe2x80x9cn-zxe2x80x9d may be assigned to partition 2. If a third partition is added, and there is a requirement to assign equal ranges among the partitions, each partition""s range will change. For example, partition 1 will acquire values of xe2x80x9ca-ixe2x80x9d, partition 2 will acquire values of xe2x80x9cj-rxe2x80x9d, and partition 3 will acquire values of xe2x80x9cs-yxe2x80x9d. Since the range changes affect each partition, it is likely that at least one half of the data records will move between partitions. Therefore, like hashing, the range method for indexing and accessing data is not efficient if files are partitioned and the number of partitions changes.
Thus, there is a need for a method of assigning identifiers to records and of subsequently locating a record given its identifier that provides a high level of flexibility and efficiency where the identifiers are applied to collections of uniformly structured records partitioned across multiple disks.
The described embodiment of the present invention introduces the creation and use of two record-addressing functions. One function xe2x80x9ccomputeRIDxe2x80x9d assigns identifiers to records, while the other function xe2x80x9clookupRIDxe2x80x9d performs the related task of locating a record given its identifier. Although the functions are highly flexible and efficient and may be applied to all types of database structures, the functions are preferably applied to environments having file organizations in which the files are collections of uniformly structured records that are partitioned across multiple disks. Example file organizations include a one-dimensional array of records such as Relative files by Tandem Computers, a division of Compaq, of Cupertino, Calif., a key sequence such as a B-tree, and a hash file table. The functions may also be applied to various types of partitioning methods. For example, the functions may be applied when records are assigned to a partition using either a round-robin method or hashing.
The two record addressing functions of the present invention possess the properties of efficient record lookup, easy map maintenance, and flexible record addressing. The functions provide efficient lookup with a straightforward approach that does not include physical identifiers. The complexity of the functions is on the order of n operations, where n is the number of partitions. Therefore the functions iterate once for each partition. Furthermore, the mapping is independent of the number of records actually in the file.
The addressing functions of the present invention promote easy map maintenance since each function may be applied to any type of file. There is no file-specific state to maintain such as a search structure that maps ranges of identifiers to partitions. Furthermore, a file may evolve over time, where the number of partitions grows and shrinks. The mapping changes automatically as the number of partitions grows and shrinks, thereby relieving a user of the burden of specifying how the mapping changes (e.g., range).
The addressing functions promote flexible record addressing, particularly when increasing or deleting the number of partitions in a file. For example, when a table with n partitions is grown to n+1 partitions, only every nth record from the original n partitions is remapped to the new partition. No other records need to be remapped from one of the original n partitions to another.
The properties described above tend to produce balanced partitions, or files in which records are spread uniformly over partitions. Furthermore, the addition or deletion of a partition is efficient, since only 1/n of the records must be moved to a new partition. Also, the user is not burdened with the task of determining which identifier range to map to a new partition to achieve balanced partitions since the function performs this task automatically.
In accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method for assigning identifiers to records, by a data processing system having a memory, comprising the steps of: receiving a partition number, an offset and a total number of partitions; repeatedly looping a number of times equal to the total number of partitionsxe2x88x921; the loop including: if the partition number is equal to a current loop value, determining a new partition number and performing a first offset compression to yield a new offset, and if the partition number is greater or less than the current loop value, performing a second offset compression to yield a new offset; and returning the offset as an identifier.
Furthermore in accordance with the purpose of the invention, as embodied and broadly described herein, the invention relates to a method for looking up a location of a record, by a data processing system having a memory, comprising the steps of: receiving an identifier and a total number of partitions; setting a partition number and an offset to values for a one partition file and setting a loop value to two; repeatedly, while loop value is less than or equal to the total number of partitions, selectively setting a new partition and performing a first offset compression whenever the offset is evenly divisible by the loop value and selectively performing a second offset compression whenever the offset divided by the loop value results in a remainder; and returning the partition number and the offset.