One of the most important resources within a data processing system is the amount of memory directly available for utilization by tasks during execution. Accordingly, much interest has been directed to efficient utilization of memory and memory management strategies. An important concept in memory management is the manner in which memory is allocated to a task, deallocated and then reclaimed.
Memory deallocation and reclamation may be explicit and controlled by an executing program, or may be carried out by another special purpose program which locates and reclaims memory which is unused, but has not been explicitly deallocated. "Garbage collection" is the term used in technical literature and the relevant arts to refer to a class of algorithms utilized to carry out storage management, specifically automatic memory reclamation. There are many known garbage collection algorithms, including reference counting, mark-sweep, and generational garbage collection algorithms. These, and other garbage collection techniques, are described in detail in a book entitled "Garbage Collection, Algorithms For Automatic Dynamic Memory Management" by Richard Jones and Raphael Lins, John Wiley & Sons, 1996. Unfortunately, many of the described techniques for garbage collection have specific requirements which cause implementation problems, as described herein.
A data structure may be located by a "reference", or a small amount of information that can be used to access the data structure. One way to implement a reference is by means of a "pointer" or "machine address", which uses multiple bits of information, however, other implementations are possible. General-purpose programming languages and other programmed systems often use references to locate and access data structures. Such structures can themselves contain references to data, such as integers or floating-point numbers and to yet other structures. In this manner, a chain of references can be created, each reference pointing to a structure which, in turn, points to another structure.
Garbage collection techniques determine when a data structure is no longer reachable by an executing program, either directly or through a chain of pointers. When a data structure is no longer reachable, the memory that the data structure occupies can be reclaimed and reused even if it has not been explicitly deallocated by the program. To be effective, garbage collection techniques should be able to, first, identify references that are directly accessible to the executing program, and, second, given a reference to a data structure, identify references contained within that structure, thereby allowing the garbage collector to trace transitively chains of references.
A subclass of garbage collectors known as "relocating" garbage collectors, relocate data structures that are still reachable by the executing program. Relocation of a data structure is accomplished by making a copy of the data structure in another region of memory, then replacing all reachable references to the original data structure with references to the new copy. The memory occupied by the original data structure may then be reclaimed and reused. Relocating garbage collectors have the desirable property that they compact the memory used by the executing program and thereby reduce memory fragmentation.
Because relocating garbage collectors modify references during the garbage collection process, it is important that references be identified and distinguished from non-reference information, such as data, which cannot be modified for garbage collection purposes. Consequently, fully relocating garbage collectors belong to a subclass of garbage collection methods, known as "exact" garbage collectors, which require knowledge of the location of references or "live" pointers so that these can be modified or followed during the garbage collection process.
In order to positively identify references, some computing systems use a "tagged" representation for all memory locations. In such systems, references and primitive data, such as integers and floating-point numbers, are represented in memory in a manner that a reference always has a different bit pattern than a primitive value. This is generally done by including tag bits in each memory location in addition to the bits holding the memory location value. The tag bits for a memory location holding a reference value are always different from the tag bits for a memory location holding a datum value.
Other computer systems use an "untagged" data representation in which the entire memory word is devoted to representing the datum value. In such systems, the same bit pattern might represent a reference or a primitive value. In such systems, the distinction between references and primitive values can sometimes be made from external considerations or representations, such as the instruction that is to operate on the data, or the position of the data within an object. However, the use of external considerations to make this distinction was not possible in all systems.
For example, the Java programming language was originally designed for use in systems using untagged data representation. The Java programming language is described in detail in the text entitled "The Java Language Specification" by James Gosling, Bill Joy and Guy Steele, Addison-Wesley, 1996. The Java language was designed to run on computing systems with characteristics that are specified by the Java Virtual Machine Specification which is described in detail in a text entitled "The Java Virtual Machine Specification", by Tim Lindholm and Frank Yellin, Addison-Wesley, 1996.
According to the Java Virtual Machine (JVM) Specification, a local variable or stack slot in a computing system using 32-bit memory words may contain either a 32-bit integer, a 32-bit floating-point number, or a 32-bit reference. Consequently, tagged data representation cannot be used in all cases (programming languages that use tagged data representation on 32-bit computer architectures typically restrict the size of integers to 30 bits.) Further, in many cases, it is not possible to distinguish references from data by examining the Java instructions, because many instructions operate indiscriminately on references and data. Therefore other methods must be used to locate the live pointer information on the program stack.
To further complicate the process of locating live pointer information, many garbage collection algorithms, such as mark-sweep, relocating and generational collectors operate by halting operation of the ongoing computation, running a specialized garbage collection program and then resuming the ongoing computation. With these collectors, it is necessary to obtain the live pointer information on the program stack at the program code boundary at which the ongoing computation is stopped to perform garbage collection. In the following discussion the term "bytecode" will be used to describe a program code word. This corresponds to the case where program code operands are one byte long, however, the invention applies to systems where the program codes have other lengths as well and the term "bytecode" is not intended to be limiting. In a JVM, the ongoing computation may be stopped at many bytecode boundaries so that the problem of determining the live pointer information is complex.
Since the change in the live pointer information on the stack frame due to the operation of a particular bytecode can be calculated in many instances, one method of obtaining the live pointer information when garbage collection is needed is to start with the stack configuration at the beginning of a method and calculate, bytecode by bytecode, the change in the live pointer information for each method until the bytecode boundary at which garbage collection is to take place is reached. However, at best, such an approach is time-consuming and will lead to a large time delay at the beginning of garbage collection. In some cases, such a calculation may not be possible after the bytecode has been executed or because bytecode substitutions have been made. For example, the Java language substitutes "quick" bytecodes for some instructions under certain circumstances and it may not be possible to compute the live pointer changes with some quick bytecodes.
Another method for generating the required live pointer information is to precompute the live pointer information for each possible bytecode in a program in advance of program operation and store a map or "mask" indicating the location of the live pointers on the program stack for each bytecode. This mask computation might be performed during compilation or program load before the program is actually executed. Then, when garbage collection is initiated, the stored mask information corresponding to the selected bytecode boundary can be retrieved and used to determine the location of live pointers.
The aforementioned technique eliminates the delay required to compute the live pointer information on demand, but requires a large space overhead. For example, the live pointer information can be represented as a bit vector with one bit for each stack item. Using this representation, as a test, the live pointer information was precomputed for a sample program run. During this run 206,034 program bytecodes were loaded and 693,288 bits were required to store the precomputed live pointer masks (the latter figure does not include data structures which would be necessary to store and retrieve the bits). In many contexts, such a space overhead would be prohibitive.
Accordingly, there is a need for a technique to locate live pointers in the active stack frame on computer systems which do not accommodate tagged data representations without requiring an on demand computation of the live pointer information and without requiring live pointer information to be stored for all bytecodes.