1. Field of the Invention
The present invention relates to enhancing data access and control in an improved data processing system. Still more particularly, the present invention provides for resource optimization in data processing systems via a stable hashing mechanism.
2. Description of Related Art
Effective data management requires efficient storage and retrieval of data. A variety of techniques for information storage and retrieval are well known, including a technique known as “hashing”. In a typical hashing implementation, an inputted datum is used as a key for retrieving information associated with the key. The information is stored in a data structure, which is usually some form of a table. Given a specific key, a hash function computes an index into the table based on the key, and the associated information is then stored or retrieved from the location indicated by the computed index. In other words, the hash function “hashes” the key, in effect mapping the key value to an index value. Many keys may map to the same computed index, which may cause resource collisions in some implementations, and these collisions may be resolved using a variety of well known techniques. If the hashing function is easily computable and provides an even distribution of the keys across the range of mapped values, then hashing may provide an efficient storage mechanism.
All information storage and retrieval techniques strike a balance between the amount of storage resources that are used to store the information and the speed by which the information is stored and retrieved from the storage resources. In the typical storage implementation noted above, hashing provides a methodology in which an increased amount of storage can be used to increase access speed. A typical hashing implementation does not utilize some of its allocated storage resource in return for a quick manner of storing and retrieving information in its storage resource. While a hash function should distribute the keys across the entire range of table indices, a typical hash implementation does not ensure that all hash table entries are used in any given period of time.
In general, hashing can be interpreted as providing a methodology for mapping a source identifier (ID) to a target ID in order to obtain an association between information identifiable by the source ID and other resources or information identifiable by the target ID. In other words, a hash function maps identifiers between ID spaces. The inputted values into a hash function can be viewed as representing entities in a source ID space while the outputted values from a hash function can be viewed as representing entities in a target ID space. A properly chosen hash function can provide an efficient mechanism for mapping values from the source ID space to values in the target ID space.
Specifically, in a typical storage application, the target ID is an index into an entry in the hash table, and the hash table entry has previously been associated with a target resource. After mapping the key to the target ID, i.e. hash table index, the information in the entry of the hash table is used to determine a storage location for storing or retrieving information associated with the key. The location may be the hash table entry itself, or the hash table entry may have some type of pointer or other identifier that points to a storage location, object, or resource.
Viewed in this broad manner, a hash function allows information to, from, or about the source entity to be associated with information to, from, or about the target entity. Assuming that the target entity is some type of computational resource, then the source ID becomes associated with a target ID, which then performs some type of computational process on behalf of the entity represented by the source ID. The computational process is usually either a storage process or a routing process. In either case, a hash function can be viewed as assisting a type of distributional process.
Hash computations are frequently implemented for distributing computational resources. For example, it is desirable in Web-based applications to route requests from clients to servers so that, once a request is routed from a particular client to a particular server, all requests from that client will be routed to the same server. Given a unique ID for the particular client and a unique ID for the particular server, a hash computation may be employed to map incoming requests from the client to the same server for the duration of the client session. This type of process has been termed “hash routing” or “hash-based routing” and may be applied to a variety of Web-based applications, such as the caching of Web content in an array of cache servers.
All hashing functions are generally required to provide an even or fair distribution of source IDs over the target ID space in order to perform the distribution of computational resources. Otherwise, the distribution is clumpy and must be corrected or compensated, which slows down the distribution computation and defeats a major advantage of employing a hashing function. In order to achieve acceptable distribution of source IDs over the entire target ID space, a typical hashing implementation assumes that the set of target resources will remain unchanged, and hence, the target ID space is expected to be static.
The size of the target ID space is then used as a computational parameter in several aspects because of this assumption about the static nature of the target ID space. For instance, an initial hash table may be allocated at a predetermined size that matches the expected size of the target ID space, and the expected size of the target ID space is also used as a parameter within a hash function. By assuming that the predetermined size of the hash table matches the size of the target ID space, a typical hash function can be assured that it fairly distributes the source IDs over the target ID space if it fairly distributes the source IDs over the hash table.
At some point in time, though, it may be determined that the capacity of the hash table should be increased or decreased to accommodate a different target ID space for some reason. If it is determined that the size of the hash table should be changed, then the parameter within the hash function that determines the size of the target ID space must also be changed, thereby manifesting a change in the behavior of the hash function in mapping the source IDs over the newly defined target ID space.
In order to maintain the integrity of the entire process, the source IDs must be remapped to different hash table indices using the newly defined hash function, eventually resulting in the previous hash table being replaced by a new hash table. Hence, resizing the hash table causes a large performance penalty to be paid when the target ID space is changed. Most implementations of hashing algorithms assume that the set of target resources will remain relatively unchanged and accept a performance penalty when the set of target resources is changed.
In many data processing systems, though, the amount of computational resources varies over time. Continuing the same example of client-to-server mapping, if a server fails or the overall capacity of the system changes, e.g., due to the addition of another server, the size of the target ID space also changes. In order for the system to be able to distribute the client requests evenly over the new set of servers, the hash computation must be able to map the client IDs, i.e. source IDs, evenly over the newly redefined server ID space, i.e. larger or smaller number of target IDs. However, one would like to avoid a scenario in which all of the client IDs are remapped to different server IDs using a new hash function. Otherwise, subsequent requests from a particular client would no longer be routed to the same server that was receiving those requests prior to the redefinition of the server ID space. For example, if the servers perform caching operations for clients, one desires to maintain an affinity between a particular user's requests and a particular server in order to efficiently cache information for the client in the server. If the client-to-server mapping is not stable, then when a server is added or removed, it would be very disruptive as most of the cached information would need to be reaccessed.
Hence, the use of hashing techniques may be impractical when the number of mapped computational resources varies over time. In some solutions, compensation mechanisms and rules have been implemented. Other solutions have involved coordination across mapping points, such as shared mapping tables. These solutions can be complex, difficult to implement, and not sufficiently scalable. For example, one type of caching algorithm, the Cache Array Routing Protocol (CARP) algorithm, greatly degrades its performance as the number of caching servers increases.
Therefore, it would be advantageous to provide a method and apparatus in which a hashing mechanism remains stable while the availability of computational resources varies over time, e.g., the mechanism is stable with respect to the assignment of IDs to servers in a dynamically varying set of servers. It would be particularly advantageous if the hashing mechanism had wide applicability to a variety of computational problems with consistent results when executed on a variety of computer platforms.