Conventional compute program validation techniques contemplate checking the "type" of the access requested against the "type" of the data item being accessed. (A data item is often referred to as an "object" or a "data object".) An analogy might be to describe the attempted access's type as a key, and the object's type as a lock. The computer program compiler and/or run-time support could check the lock (as defined by the object type of the data object) to prevent accesses to that data object by any attempted access whose key (as defined by the access type of the access) does not meet the requirements of that lock. For example, the C programming language defines many different object types, some examples being integer (int), floating point (float), or pointer (*). When an expression in the program attempts to access the object, the compiler checks the access type of the access against the object type of the object. An attempt to access an integer object with a floating point access should be detected as an error by the compiler, and flagged as such.
In the following discussion, a pointer is a program variable which contains the address of another variable and also possibly contains attributes describing the variable pointed to; a pointer can be used to derive an address (called a pointer value) to be used to access a data object located in computer memory. Generally, a pointer is also located in computer memory (either in storage or in a register). A pointer provides one level of indirection, in that rather than addressing the data item itself, a program can address the pointer, which in turn provides the pointer value used to address the data item.
The referent of a pointer is the variable (also called the data object or the object) whose memory address is contained in the pointer. The contents of a pointer can be copied into other pointers, and thus several different pointers to a single referent may exist at the same time.
The type of a pointer specifies certain attributes of the referent (e.g., specifying to the compiler that the referent is "integer" type as opposed "floating point" type).
A memory access is, e.g., a read or a write to a referent.
The term dereference is used as a blanket term for any indirect memory access (i.e., a memory access through use of a pointer) of a referent--either through application of the dereference operator (e.g., `*` or `-&gt;` in C) to a pointer, or through indexing an array or pointer variable (e.g., `[]` in C).
Programming errors are costly, both in terms of time and money. Memory access errors are particularly troublesome. A memory access error is any dereference of a pointer or subscripted array reference which attempts to read or write storage outside of the referent. This attempted access can either be outside of the address bounds (also called the a address space) of the referent, causing a spatial access error, or outside of the lifetime of the referent, causing a temporal access error. Indexing past the end of an array is a typical example of a spatial access error. A typical temporal access error is assigning to a heap allocation after it has been freed.
Memory access errors are possible in programming languages with arrays, pointers, local references, or explicit dynamic storage management and are an important class of errors to reliably detect. For example, in {Miller:90}, Miller et al. injected random inputs (a.k.a. "fuzz") into a number of UNIX utilities. On systems from six different vendors, nearly all of the seemingly mature programs could be coaxed into dumping core. The most prevalent errors detected were memory access errors. In {Sullivan:91}, Sullivan and Chillarege examined IBM MVS software error reports over a four year period. Nearly 50% of all reported software errors examined were due to pointer and array access errors. Furthermore, of these errors, 25% were temporal access errors.
Memory access errors are difficult to detect and fix because the effects of a memory access error may not manifest themselves except under exceptional conditions and, when they do occur, they may be difficult to reproduce. In addition, once the error is reproduced, it may be very difficult to correlate the program error to the memory access error.
Consider the following C function:
______________________________________ int FindToken(int *data, int count, int token) { int i = 0, *p = data; while ((i &lt; count) && (*p != token)) { p++; i++; return (*p == token) ; //error: this tests beyond data if no token is found } ______________________________________
This function contains a latent memory access error in the return statement expression. In operation, the function will reference the word immediately following the array referenced by the pointer data if the array does not contain the token; if the word immediately following the army then does contain the token, the wrong value will be returned by return (*p==token);. To avoid this error, the expression return (*p==token); should be changed to return (i&lt;count);.
This function illustrates the three difficulties in finding and fixing memory access errors. First, FindToken() will only produce an incorrect result if the word following the array referenced by data contains the same value as token (or is inaccessible storage). This event is unlikely if the word contains an arbitrary value. Second, if FindToken() creates an incorrect result, it will be difficult to recreate during debugging. The programmer will have to condition the inputs of the program such that the word following the array referenced by data once again contains the same value as token. If the value of the illegally accessed word is independent of the value of token, the probability of success will be very low. Third, correlating the visible errors of the program to the incorrect actions of FindToken() may be very difficult. This connection may be very subtle and may not be visible for a long period of time.
Debugging can be viewed as an attempt to correlate a program fault to a program error. A program error is defined as an output of a program that is incorrect with respect to the specification of that program--this effect is what the users see. The program fault, on the other hand, is the initial incorrect condition (possibly many) that ultimately caused the error condition to occur. The primary goal of any good debugging environment is to detect errors and provide a good correlation between errors and faults. It is preferable to detect memory access errors immediately, thus creating perfect correlation between the error and the fault.
Many execution environments do provide some level of protection against memory access errors. For example, in most UNIX based systems, a store to the program text will cause the operating system to terminate execution of the program (usually with a core dump). UNIX typically provides storage protection on a segment granularity--the segments are the program text, data, and stack. Other, more hostile environments such as MS-DOS, do not offer such luxuries, and stores to the program text may or may not manifest themselves as a program error. If a program error does occur, correlating it to a fault may be difficult, if not impossible.
As programs become larger and more complex, there is a need for more sophisticated and comprehensive development tools to help the programmer "debug" these programs. In particular, there is a movement in the programming community towards "safe programming" techniques, languages and tools. Unfortunately, many of these safe programming techniques sacrifice the expressiveness otherwise available to the programmer using a programming language like C or C++.
One safe programming technique is to check the "spatial validity" of a particular pointer value used to access a particular data object, e.g., checking that the access goes to an address which is within the address bounds defined for that object. Any program which incorporates such spatial validity checks is said to exhibit "spatial safety". Ideally, the tools used by the programmer would check each attempted access for spatial validity, and would detect, flag, and identify any spatial error to the programmer so the error could be corrected.
Another safe programming technique is to check the "temporal validity" of using a particular pointer value to access a particular data object, e.g., checking that the data object is indeed allocated before it is written to, initialized before it is read, and is neither read nor mitten to after it has been freed or destroyed. Any program which incorporates such temporal validity checks is said to exhibit "temporal safety". Again, ideally, the programmer's tools would detect any temporal access error, and would flag the temporal error to the programmer so it could be corrected.
There are times when a programmer would like to be notified at the moment when a data object is accessed, or informed of how many times a data object is accessed or of which pointer(s) were used to access a particular data object. One technique for providing this object-access information to the programmer is to instrument a program by adding watchpoints. Instrumenting a program inserts additional code into a program in order to perform some auxiliary task. However, providing such watchpoints to a flexible and expressive language such as C/C++ can be difficult and cumbersome, and can significantly slow the execution speed of the program.
One technique for creating a safe programming environment for C is to employ a reference-chaining technique. This technique is similar to that used by many "smart pointer" implementations {Edelson:91,Ginter:92}. The reference-chaining technique creates a reference chain for each data object in the computer system and "roots" (or otherwise associates) each reference chain with its data object. This technique then inserts, into the reference chain rooted at the referent, any pointer to that referent which is generated through the use of an explicit memory allocation (e.g., the malloc() function in the C language), a reference operator (e.g., the `&` operator in the C language), or an assignment (e.g., the ,`=` operator in the C language). When a pointer is later destroyed (e.g., through memory deallocation, assignment, or return from a procedure), this technique then removes the pointer from the reference chain.
The reference-chaining technique has a number of useful properties. First, it is possible to ensure temporal safety by destroying all pointer values on a referent's chain when a referent is freed (i.e., when the memory for that referent is deallocated)--simply by stepping down the reference chain of that referent, and assigning NULL to all pointer values. Second, if a "destructed" pointer value is the last value in the referent's reference chain, it will be as the result of a storage-leak error having occurred, which can thus be detected immediately. (A storage leak is any area in storage to which the program can no longer generate a valid pointer (generating such a pointer is also called generating a "name" for the area). Storage-leak errors occur when the last accessible valid pointer to a heap object is overwritten. Without the ability to generate a name to the heap object, the heap object cannot be freed; hence it has "leaked" out of the heap.)
Unfortunately, the reference-chaining technique cannot be made to work reliably in C. It is relatively easy for the programmer to subvert the checking mechanism through recasting and type-less calls to free(), the memory deallocation function. Detection of storage-leak errors also fails in the presence of circular references, where a chain of pointers-to-pointers eventually refers back to an earlier pointer. Additionally, the reference-chaining technique can be unreliable became it depends on tracking pointer values.
Researchers have recently proposed providing complete program safety through limiting the constructs allowed in the programming language. The main thrust of this work is to design programming languages that support garbage collection reliably and portably (i.e., in a manner in which an implementation can be re-used across several different programming languages or computer architecture platforms). For example, in "Safe:GC" {Safe:GC}, a safe subset of C++ is defined. The safe subset does not permit any invalid pointers to be created. For example, pointers cannot be created via explicit pointer arithmetic. If requested, the compiler can enforce safety within a module by ensuring that the programmer does not use any intrinsically unsafe operations. The safe subset requires that some amount of checking be performed.
In addition, languages which can easily be made totally safe have existed for a long time. For example, many FORTRAN implementations provide complete safety through range checking (e.g., {MIPS:F77}). However, as in Safe:GC, these languages tend to be less expressive than intrinsically unsafe languages such as C or C++.
A number of commercially available memory access checking tools exist for memory access checking. For instance, Hastings and Joyce's "Purify" {Purify:92} uses a safe programming technique which is particularly easy to use because it does not require program source-- all semantic changes to the program are applied to the object code. Purify supports both spatial- and temporal-access error checking to heap storage, through the use of a memory state map which is consulted at each load and each store that the program executes. Purify also provides uninitialized read detection and storage-leak error detection through a "conservative collector" {Boehm:93,Boehm:88} (described in more detail below). Certain heap spatial access errors are detected by bracketing both ends of any heap allocation with a "red zone". These zones are marked in the memory state map as inaccessible. If a load or store touches a red zone, then a memory access error is flagged. Temporal access errors are detected by setting the memory state of freed storage to "inaccessible".
Purify cannot detect all memory access errors. For example, errors caused by accessing past the end of an array into the storage region of the next variable cannot be detected, nor can errors caused by accessing storage that has been freed and then reallocated. These limitations occur because Purify does not determine the intended referent of memory accesses--it can only verify whether the accessed storage is "active". To increase the effectiveness of temporal access error checking, Purify "ages" the heap, i.e., holds freed storage in the "heap free list" longer than needed. This aging increases the storage requirements of programs that use the heap. In addition, although Purify is portable across programming languages (as long as each language is available on the given computer architecture platform for which Purify is implemented), it is not portable across platforms, and must be re-written for each platform on which it is desired.
Hastings' U.S. Pat. No. 5,193,180, issued Mar. 9, 1993 and assigned to Pure Software Inc., describes an implementation of the Purify technique. An object-code expansion program inserts new instructions and data between preexisting instructions and data of an object-code file; offsets are modified to reflect new positions of the preexisting instructions and data. The added instructions monitor substantially all memory accesses to check for the errors of writing to unallocated memory, and reading from unallocated or uninitialized memory. Dummy entries are added to the data section to aid in the detection of array-bounds violations and similar data errors. Furthermore, watchpoints can be established for more comprehensive monitoring.
Another safe programming technique is used in Steffen's "RTCC" {Steffen:92}. RTCC extends the functionality of the AT&T C language compiler "PCC" by adding spatial-error checking. RTCC attaches spatial object attributes to pointers and performs spatial access error checking. It does not, however, detect temporal access errors. In the implementation of RTCC, the issue of interfacing to library and system calls is addressed through "encapsulation"; Steffen also describes augmenting "sdb" (the UNIX system debugger) to provide users with transparent debugging support.
Another safe programming technique is used by "CodeCenter" {Kaufer:88}. CodeCenter is an interpreted C language environment. The checking provided is very rich--it detects many memory access errors, and also provides dynamic type-checking (i.e., the type of the last store to memory must match the type of subsequent loads from memory), uninitialized read detection, errant free detection, and other useful checks. (The following heap-deallocation actions are called errant frees: freeing storage which has been previously freed; freeing non-heap (global or stack) storage; freeing an invalid address (one that does not refer to valid storage); freeing heap storage using an interior pointer (a pointer that points inside the allocation, rather than to the start of the allocation).) Object attributes (namely, type and size) are attached to each data object in storage when it is initialized and, when a reference is made to storage, the base and size attributes that are associated with the referent storage are also attached to the pointer value. Using this information, CodeCenter provides complete coverage for spatial access errors. The method used for temporal access checking cannot, however, detect all attempts to access freed storage after it has been reallocated for another purpose nor can it detect errors when pointer references are made to local variables. In addition, CodeCenter has large resource requirements; since CodeCenter programs run in an interpreter, the slow execution speed may discourage its use, and in the case of long-running programs, may preclude its use entirely.
Another safe programming technique is used by "Integral C" {Ross:87}. Integral C is an integrated programming environment for the C language. The user interface is very similar to CodeCenter. Internally, however, it does not employ an interpreter. Instead, as the programmer/user updates the C code, the C code is incrementally compiled (at function granularity) into machine code. Like RTCC, Integral C attaches only base and bound attributes to pointer values, and thus it can only detect spatial access errors.
Yet another safe programming technique is used by Fischer and LeBlanc's "UW-Pascal" compiler {Fischer:80}. UW-Pascal supports both temporal and spatial access error checking, but while UW-Pascal detects all spatial access errors, certain temporal access errors may not be detected if storage is reallocated. Because UW-Pascal lacks mutable pointers (pointers which may be used as terms in arithmetic expressions, thus allowing their value to be arbitrarily manipulated by the program) and dynamically-sized arrays, however, its access checking is much easier to implement than the error checking of other techniques which handle these more expressive and flexible programming-language features.
This paragraph briefly summarizes properties of the above described commercially-available systems. The technique used in Purify operates on object-code files, performs an object-code translation, provides spatial checks limited to heap spatial access errors, provides temporal checks limited to heap temporal access errors, and has extensions that can detect errant free's, uninitialized reads, and storage-leak errors. The technique used in RTCC operates on C files, uses a safe compiler, provides spatial checks for all spatial access errors, but provides no temporal checks. The technique used in CodeCenter operates on C or C++ files, uses an interpreter, provides spatial checks for all spatial access errors, provides temporal checks for some temporal access errors, and has extensions that can detect errant free's, uninitialized reads, type errors, arithmetic errors, etc. The technique used in Integral C operates on C files, uses a safe compiler, provides spatial checks for all spatial access errors, but provides no temporal checks. The technique used in UW-Pascal operates on Pascal files, uses a safe compiler, provides spatial checks for all spatial access errors, provides temporal checks for some temporal access errors, and has extensions that can detect errant free's, union type errors, arithmetic faults, etc.
A closely related area of work, which can benefit from the safe programming technique described in the invention, is storage-leak error detection. For languages like C and C++, storage-leak error detection is commonly implemented with a "conservative collector" {Boehm:93,Boehm:88}. A conservative collector sweeps memory looking for unreferenced storage. Because it is difficult to know where all pointers are located, the collector makes the conservative assumption that all program-accessible (non-heap) storage contains pointers. It then uses a traditional mark-and-sweep collection method.
While effective, this method has some drawbacks. First, storage-leak error detection is not immediate--it is usually applied only when the programmer demands it or when the program completes execution. Thus, for it to be useful, some dynamic information (for instance, a partial call-chain) must be kept with allocations in order for the programmer to deduce the circumstances under which the storage-leak error occurred. Second, the conservative assumption (that all program-accessible (non-heap) storage contains pointers) can cause "false hits". (False hits occur when "random" non-pointer values, which seem to reference heap storage, are mistaken for pointer values.) False hits can hide an actual storage-leak error. For instance, it may appear to the conservative collector as though some random number is a pointer to an area of storage on the heap; in actuality the storage pointed to by the random number leaked from the heap when the last valid pointer to it was destroyed in error. This problem is aggravated by large storage allocations. In such allocations it is more likely that non-pointer values may randomly and inadvertently reference the allocated storage; unfortunately, it is these large storage-leak errors that the programmer would most like to find. Third, if the program hides pointers (for example, by encoding type information in the upper bits of the address in a pointer) or does not keep all pointers within the address bounds of memory allocations, then the conservative collector may not recognize a valid pointer, and thus may erroneously regard a piece of heap storage as having leaked from the heap, when actually it is still in use.
(A call-chain is the state of the stack at some point in a program's execution; it is composed of a sequence of function names; functions higher in the call-chain call (possibly indirectly) the functions lower in the call chain; neighbors in the call-chain share a direct caller-callee relationship. A partial call-chain is a subset of the current complete call-chain, usually taken from the bottom of the complete call chain; partial call-chains are usually employed to reduce storage requirements.)
Zorn and Hilfinger's "mprof" takes a notably different approach to detecting storage-leak errors {Zorn:88}. During the analyzed program's execution, mprof maintains a table of partial call-chains, with each table entry containing a count of how many malloc()'s and free()'s have occurred to storage whose call-chains terminated with that sequence. Detecting storage-leak errors then involves adjusting the appropriate counts at calls to malloc() and free(). At a malloc(), the current call-chain is used to increment the appropriate malloc() counter. At a free(), a hidden pointer in the header of the freed allocation is used to increment the corresponding free() counter. At program termination, detection of storage-leak errors involves reporting the partial call-chains whose malloc() and free() counts differ.
Unlike conservative collection, the mprof technique does not suffer from false hits; that is, a true storage-leak error will always be detected. In addition, mprof provides a wealth of other information useful for optimizing a program's memory usage. The primary disadvantage of the mprof technique (compared to conservative collection) is that storage-leak diagnostics may only be gathered after execution completes, and many programs do not deallocate storage until program termination (e.g., in C, the call to exit() will ensure that all the program's resources are deallocated/reclaimed). This behavior can yield many (arguably) false indications of storage-leak errors.
None of the above methods provide the detection of temporal and spatial errors needed in the sophisticated programming environments of today without either impacting the flexibility of the programming language or overlaying an oppressive amount of overhead. What is needed is a method of detecting memory access errors which can operate over a variety of programming languages while having minimal impact on program execution.