The present invention is generally directed to data processing and in particular to a data processing method and apparatus for generating a correct memory address from a character or digit string such as a record key.
Computers and computer networks have become so prevalent in modern society that it is easy to take their inner workings for granted. People using a computer, computer users, generally assume that any data entered into the computer or a network of computers will be stored and retrieved in an efficient way. Over the years, researchers have made sophisticated improvements on simple data storage and retrieval schemes that were designed by various data processing pioneers. While these sophisticated improvements have solved many problems, distributed or parallel processing architectures such as computer networks, multiprocessing systems, and the like still provide new challenges. As discussed in greater detail later herein, the present invention provides solutions to these new challenges.
One of the simple data storage and retrieval schemes designed by data processing pioneers is known as a basic hashing scheme. The basic hashing scheme has been used in many areas of data processing such as database addressing and data encryption. The basic hashing scheme will now be introduced with reference to the particular example of database addressing.
Data are commonly stored in an organized manner known as a database under control of a data processor. By means of this data processor, a computer user is able to browse or search through the data, or to insert, amend or delete data as required. Although the computer user has no need to be aware of the exact manner in which data are stored, it is important for speed of operation that the data processor is able to locate and retrieve a particular data record as accurately and quickly as possible. In practice, this requires that the data processor is able to locate a desired data record directly, without having to search through the entire store of data.
In general, each data record comprises two sections: a short record key identifying the record, and a remainder of information associated with that key. For example, a particular database may use a person's name as the key to identify a larger portion of data containing further details about that person. For small data records, the entire record may be used as the key. From the information in the key alone, the data processor must be able to find the physical location in the stored data in order to access the data record corresponding to the key. For example, if the data is stored on a magnetic or optical disk, the track and sector address is required before a particular data record can be accessed. In the case of hardware storage in Random Access Memory, RAM, the memory address is required.
A direct key to memory address conversion scheme can be used to deduce a memory address for a particular location for a particular data record. The record's key is converted directly into a numerical key value, which in turn is scaled or otherwise manipulated to produce a data record address lying within the possible range of addresses appropriate to the storage means in use. When the data record is originally stored, it is stored at the address calculated as above from the record's key. Again, if subsequent access to the record is required, the data processor searches for the record at an address calculated from the record's key at the time at which the access is required.
The direct key to memory address conversion scheme just described is fast in operation, however it suffers from a disadvantage that the numerical key values are not usually uniformly distributed across the range of possible numerical key values. For example, if each key is a respective subject's personal name, it is clear that certain character strings of a common last name such as "Jones" will appear many times in the keys, while other character strings such as "ABCDEFG" are extremely unlikely to appear. The direct key to memory address conversion scheme leads to a bunching of data records around groups of adjacent memory addresses,while other memory addresses remain unused. This can lead to very inefficient use of the available storage space. To overcome this inefficiency, the basic hashing scheme is used to calculate a memory address from the record's key.
In accordance with the basic hashing scheme, the data record's key is "hashed", meaning that the record's key is converted into a near-random number by a mathematical function known as a hashing function. The near-random number is then scaled or otherwise manipulated as discussed previously to produce a data memory address lying within the possible range of addresses appropriate to the storage means in use. In accordance with the basic hashing technique, a previously known procedure for storing an individual data record is as follows:
a) The record's key is hashed to produce a near random number. The near random number identifies a data storage location, also known as a data storage "bucket". PA1 b) The near random number is used in conjunction with a look-up table to obtain a record address of the data storage location. PA1 c) If the data storage location is not already completely filled with data records, the records are stored in the data storage location. PA1 d) If however the memory location is full, then what is known as a "collision" occurs. Consequently, the record is stored in a first available space in an overflow data storage location. A pointer is associated with the full memory location to identify an address of the overflow memory location in which the record is stored. PA1 a) The record's key is again hashed to produce the near-random number. PA1 b) The near random number is again used in conjunction with a look-up table to obtain the memory address of the data storage location. PA1 c) The data storage location is searched for the record. PA1 d) If the record is not found in the memory location, then the address of the overflow memory location is retrieved and the overflow data storage location is searched for the record.
Similarly, when access to the stored record is subsequently required, the following steps are performed:
An unfortunate limitation of the basic hashing scheme is that this additional searching of the overflow data storage location may considerably slow down retrieval of the data record.
This limitation was surpassed in 1980 by a researcher named Witold Litwin, who is also a coinventor of the present invention. Litwin avoided creation of any overflow memory locations and associated searching delays by developing a sophisticated improvement upon the basic hashing scheme. Litwin described his improved "linear hashing" scheme in a frequently referenced article published in 1980 in the Proceedings of the 6th International Conference on Very Large Databases at pages 212-223. The article is entitled "Linear Hashing: A New Tool For File And Table Addressing". Numerous benefits provided by the linear hashing scheme are discussed in detail in the article. Because the article provides helpful background information, it is hereby incorporated by reference.
As a simplified brief summary, tile linear hashing scheme employs a central hash table comprising a sequential organization of a first memory location or "bucket", followed by a second data storage location or "bucket", followed by a third memory location or "bucket", and so on up to and including a last memory location of the table numbered, thereby yielding a total number, T, of memory locations. Linear hashing avoids creating and using any overflow memory locations by gradually increasing address space in response to each collision.
Address space is gradually increased in response to each collision by a process of "splitting" a respective predetermined one of memory locations or "buckets". In response to each collision, the respective predetermined memory location or "bucket" is "split" by creating a respective new memory location and moving approximately half of the records from the respective predetermined memory location to the respective new memory location. Splitting the predetermined memory locations or "buckets" takes place in a predetermined order so that the first storage location is split in response to a first collision. Next, the second storage location is split in response to a second collision, and so on, up to and including splitting the last storage location in response to a subsequent collision. A pointer keeps track of the next memory location to be split.
When the total number of memory locations, T, have been "split", the table size is doubled to a new total number of memory locations, 2T, by annexing all of the new data storage buckets and incorporating them at the end of the table. The pointer is reset to zero and the splitting process begins over again in a similar manner as discussed previously herein. With each subsequent collision, the pointer is incremented to track the next storage location to be split. The memory locations are once again split in sequential order, once again doubling the table size, yielding another new total number of storage locations, 4T. In accordance with such linear expansion principles, the table continues expanding as long as required.
The linear hashing scheme provides many advantages, however new challenges are created by multiprocessing systems. While an aggregate performance and memory size of a multiprocessing system is very high, the central hash table prevents the distributed processor architecture from achieving high levels of performance using the linear hashing scheme. A partial solution is provided by a distributed linear hashing scheme proposed in an article entitled "Distributed Linear Hashing And Parallel Projection In Main Memory Databases" by C. Severence et al, published in 1990 in the Proceedings of the Sixteenth International Conference on Very Large Data Bases, pages 674-682. Because this article provides helpful background information, it is hereby incorporated by reference. Severence et al. propose using the distributed linear hashing scheme on a tightly coupled multi-processor computer with a shared memory, known as a NUMA architecture multi-processor system.
The distributed linear hashing scheme employs many of the expansion principles of the linear hashing scheme. However, instead of using the central hash table of the linear hashing scheme, the distributed linear hashing scheme provides that each memory location or "bucket" includes a respective hash directory. Records are located by using the hash directories in conjunction with local file parameters called Local.sub.-- M and stored in a respective cache memory of each processor. Another file parameter called Global.sub.-- M accurately represents a current total number of memory locations or "buckets" in the shared memory, while the Local.sub.-- M parameters may or may not accurately represent the current total number of storage location in the shared memory. When any processor doubles the number of memory locations in accordance with the expansion principles of linear hashing discussed previously herein, the processor updates the file parameter called Global.sub.-- M by doubling Global.sub.-- M. As discussed on the first full paragraph of page 677 of the Severence article, the processor updating Global.sub.-- M must also update all of the copies of Local.sub.-- M in all of the other processors. Such global update requirements limit system efficiency, especially as the multi-processor computer is scaled up to include a very large number of processors. It should be particularly noted that simultaneous global updates to all processors require that all processors are simultaneously available, thereby precluding any continuous autonomy of the processors.
Currently, more and more data base applications are mission critical and require fast storage and subsequent searching, and retrieval of unpredictably large amounts of incoming data. Database operations such as storing records, searching for records and retrieving records all require generating a correct memory address from a record key value. Accordingly, what is needed is a flexible and efficient data processing method and apparatus, which generates a correct memory address from a character or digit string such as a record key value, and which is adapted for use in distributed or parallel processing architectures such as computer networks, multiprocessing systems, and the like.