A. Field of the Invention
This invention generally relates to memory management for computer systems and, more particularly, to a method and apparatus for managing hashed objects.
B. Description of the Related Art
(1) The Java.TM. programming language "hashCode" Method
The Java.TM. programming language is an object-oriented computer programming language. The language is described in many texts, including one that is entitled "The Java Language Specification" by James Goslin, Bill Joy, and Guy Steele, Addison-Wesley, 1996. Java and all Java-based trademarks are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. In an object oriented system, such as one or more related programs written in this language, a "class" provides a template for the creation of "objects" (which represent items or instances manipulated by the system) having characteristics of that class. The term template denotes that the objects (i.e., data items) in each class, share certain characteristics or attributes determined by the class. Objects are typically created dynamically during system operation. Methods associated with a class generally operate on the objects of the same class. The Java language was designed to run on computing systems with characteristics that are specified by the Java Virtual Machine (JVM) Specification. The JVM specification is described in detail in a text entitled "The Java Virtual Machine Specification", by Tim Lindholm and Frank Yellin, Addison Wesley, 1996.
The Java language defines a method referred to as "hashCode()" in the topmost class Object that defines properties shared by all objects in a Java language system. The hashCode() method causes the JVM, or some other runtime system that supports the Java language, to return an integer hash value corresponding to a selected object. The Java language hash code() operation has the following properties: (1) It ensures that an object's hash value remains constant throughout its life-time. (2) It ensures a good distribution of hash values such that different objects to a great extent have different hash values. Using such a hashCode() operation, programmers can build data structures, known as hash tables, that support efficient lookup (search and retrieval) of data. Hash tables require the former property to ensure correctness and the latter property to achieve high efficiency.
For example, a hashCode() implementation that always returns "0", while allowing correct operation of hash tables, would be a poor choice because the lack of distribution degrades performance of hash table lookup operations. An alternative implementation of the hashCode() operation may use the address of the object as its hash code. Because different objects are located at different addresses in memory, this implementation provides excellent distribution. However, it only applies to systems where hashed objects never move, because if an object moved to a different address, its hash code would change, violating the first property.
In most object-oriented languages, it is possible to override the virtual machine hashCode() method for specific classes. For example, the String class in the Java language overrides hashCode() to compute a hash value based on the characters in the string. Thus, the hashCode() operation on two "equal" strings that contain the same characters in the same order will produce the same hash value.
For most programs the vast majority of objects are never hashed using the virtual machine's implementation of hashCode(). First, not all programs use hash tables, and those that do use hash tables do not include each and every object in a hash table. Second, as explained above, important classes of objects override the virtual machine's hashCode() implementation.
The desirable properties of the virtual machine hashCode() operation are thus summarized as follows:
1. It must ensure that a hash value for an object remains constant throughout the life-time of the object. PA1 2. It must provide a good distribution of hash values. PA1 3. It must be efficient to compute hash values. PA1 4. It must maintain low storage requirements.
Since most objects are never hashed, to ensure low space overhead it is essential to use as little space as possible for non-hashed objects. The problem is, of course, that it is typically difficult to predict whether the hashCode() operation will be applied to a given object.
(2) Garbage Collection
One of the most important resources within a data processing system is the amount of memory directly available for utilization by tasks during execution. Accordingly, much interest has been directed to efficient utilization of memory and memory management strategies. An important concept in memory management is the manner in which memory is allocated to a task, deallocated and then reclaimed.
Memory deallocation and reclamation may be explicit and controlled by an executing program, or may be carried out by another special purpose program which locates and reclaims memory which is unused, but has not been explicitly deallocated. "Garbage collection" is the term used in technical literature and the relevant arts to refer to a class of algorithms utilized to carry out storage management, specifically automatic memory reclamation. There are many known garbage collection algorithms, including reference counting, mark-sweep, and generational garbage collection algorithms. These, and other garbage collection techniques, are described in detail in a book entitled "Garbage Collection, Algorithms For Automatic Dynamic Memory Management" by Richard Jones and Raphael Lins, John Wiley & Sons, 1996.
An object may be located by a "reference", or a small amount of information that can be used to access the object. One way to implement a reference is by means of a "pointer" or "machine address," which uses multiple bits of information, however, other implementations are possible. General-purpose programming languages and other programmed systems often use references to locate and access objects. Such objects can themselves contain references to data, such as integers or floating-point numbers and to yet other objects. In this manner, a chain of references can be created, each reference pointing to an object which, in turn, points to another object.
Garbage collection techniques determine when an object is no longer reachable by an executing program, either directly or through a chain of pointers. When an object is no longer reachable, the memory that the object occupies can be reclaimed and reused even if it has not been explicitly deallocated by the program. To be effective, garbage collection techniques should be able to, first, identify references that are directly accessible to the executing program, and, second, given a reference to an object, identify references contained within that object, thereby allowing the garbage collector to trace transitively chains of references.
A subclass of garbage collectors known as "relocating" garbage collectors, relocates objects that are still reachable by an executing program. Relocation of an object is accomplished by making a copy of the object in another region of memory, then replacing all reachable references to the original object with references to the new copy. The memory occupied by the original object may then be reclaimed and reused. Relocating garbage collectors have the desirable property that they compact the memory used by the executing program and thereby reduce memory fragmentation, which is typically caused by non-compacting garbage collectors.
In systems that use a non-compacting garbage collector, objects never move. Once they are allocated a certain address, they remain there until they become garbage. It is therefore possible to use the address of the object as its hashCode(). This solution is fast and has no space overhead. It also has good distribution of hash values. However, few implementations of object-oriented languages use non-compacting memory systems, since the induced fragmentation affects performance negatively.
(3) Handle-based memory systems
The original Java programming language used indirect pointers or "handles" to refer to objects. Handles were introduced to allow easy relocation of objects during garbage collection. With handles, it is easy to move objects because there is only one direct pointer to each object: the one in its handle. In such handle-based memory systems, while the object address is non-constant over the life-time of the object and therefore cannot be used for hashing, the handle address remains constant. Thus, the hashCode() function returned the address of the handle.
This implementation, like the object address implementation in non-compacting systems, is fast, has no space overhead, and gives a good distribution. However, other concerns, including execution efficiency, favor non-handle-based memory systems. Consequently, the use of an object's handle as its hash value is not desirable.
(4) Direct-pointer, no-handle memory systems
For performance reasons, one would want to implement object-oriented languages using compacting garbage collection algorithms that work with direct pointers. Consequently, hash codes must be implemented in a different way than by using addresses. Because objects move, their address is no longer an acceptable hash value, and there are no handles in systems using direct pointers.
The known implementation of the object-oriented language called "'Self" illustrates the most common solution. The Self implementation reserves 22 bits in a header word of every object to hold its hash value. The bits are set to a 22 bit pseudo-random value. Since most objects are never hashed, the computation of hash values is done on demand: The Self implementation does not assign the 22 bit hash value at object allocation time, but defers it until the object is first hashed. While this slows down all hash retrieval operations by an extra test, such a solution is acceptable because object allocation is more frequent than hashing for most programs.
This solution also has fast retrieval and a reasonable distribution, although the hash values in the specific case of the Self implementation are compressed into a 22 bit space, even though computers provide larger number bit words, such as 32 bits. The solution also has low storage overhead, but only if the header word can accommodate the 22 bits required to hold the hash value. However, not all systems have spare bits in a header word, let alone the 22 bits required by the aforementioned Self implementation. If hashing necessitates adding an extra header word to every object, the space cost of such an implementation may not be acceptable.
Accordingly, there is a need for a system that provides a satisfactory, space-efficient hashCode() function for use with a compacting garbage collector in a system that does not use handles.