1. Field of the Invention
The present invention generally relates to data processing systems. In particular, methods and systems in accordance with the present invention generally relate to the checking of array bounds in a programming environment such as C or C++.
2. Background
Computers are increasingly important in today's society, and software used to control these computers is typically written in a programming language. C, C++ and other similar variations are examples of widely used programming languages. The programming language C is described, for example, in detail in Al Kelley et al., “A Book on C,” Addison-Wesley, 1997, which is incorporated herein by reference. In developing software, typically a software developer writes code, referred to as “source code,” in a programming language, and the source code is compiled by a compiler into “object code” that can be run by a machine.
A common source of programming errors in many programming languages arises from accessing memory outside of a valid range. A common programming error in accessing memory outside of a valid range involves over-indexing or under-indexing an array, i.e., attempting to access an array outside of its range. An array is a data structure that is commonly allocated in memory in programming environments such as C or C++. An array is a collection of typically identically-typed data items distinguished by their indices. Each item in an array is called an “array element.” For example, there may be an array of integers, characters or anything that has defined data type. Typical exemplary characteristics of an array may include (but are not required to include): (1) each element having the same data type (although they may have different values), and (2) the array being stored contiguously in memory. Arrays are generally appropriate for storing data to be accessed in an unpredictable order, in contrast to lists which are best when accessed sequentially.
Additionally, arrays may have more than one dimension. The number of dimensions an array can have depends on the programming language. A one-dimensional array is called a “vector;” a two-dimensional array is called a “matrix.” A single ordinary variable (a “scalar”) could be considered as a zero-dimensional array. A reference to an array element may typically be written in the form A[i][j][k] where A is the array name and i, j and k are the indices.
The problem of accessing memory outside of a valid array range may apply to many programming languages which enable a user to dynamically allocate memory for arrays during run-time such that dynamic (nm-time) or static (compile-time) error checking is difficult, as the dynamic allocation is separated from the semantics of use. After the memory space for an array has been allocated to the array, it may be difficult to determine whether an access or reference to an element of the array is within the valid memory range allocated to the array, especially when the array is allocated dynamically at run-time. Dynamic allocation is allocation at run-time and may not necessarily be determined before the running of the program, such as when the allocation for the array is based on a variable determined during the running of the program.
As an example, in conventional compilers, consider a static array definition and reference in C or C++:
int x, array1[100];                . . .        
x=array1[50];
Upon recognizing the definition of array1, the compiler generates code to establish storage for 100 elements having an appropriate size for integer elements, storing in a table information regarding the array name, starting address, type and size of each element, the number of elements, and the allowed range of index values for accessing elements of array1. Upon recognizing the reference to “array1[50],” the compiler typically calculates an address for the referenced element by first calculating an offset value from the initial storage address for the array, and then adding the offset value to the starting address. For this example, the reference to “array1[50]” is within the bounds of array 1 as defined, and thus should result in an appropriate access to the desired array element.
For the example discussed above, consider a reference:
x=array1[200];
Since the possible index values for array1 range from 0 to 99, i.e., elements, an index value of 200 is outside the bounds of array1 as defined, and if code for this reference is generated and executed, the information stored at the address so generated may lead to undesirable results.
Conventional approaches to array bounds checking include catching invalid memory references but typically not references to the memory space for that particular array. One class of programs that catch invalid memory references are referred to as “malloc” debuggers, which catch bugs related to memory allocated on the heap through the function “malloc( )” An exemplary malloc debugger is “Electric Fence” by Bruce Perens, and more information on Electric Fence is currently available at the URL http://perens.com/FreeSoftware.
Electric Fence replaces the default implementation of malloc with a version that allocates data in a way that helps catch overrun or underrun errors in a program. Electric Fence works by aligning allocated memory so that it is immediately followed by unmapped memory. Unmapped memory is ordinary memory to which the operating system has been instructed to deny all access. When a program reads or writes unmapped memory, the operating system sends a signal to the program that typically results in the program being terminated.
Another conventional approach that overcomes some of the previously mentioned problems is binary instrumentation. In this approach, the executable program is modified so that loads and stores are replaced with instructions that cause a “trap” also known as an “interrupt.” A trap or interrupt may be a signal informing a system that an event has occurred. When a system receives an interrupt signal, it may take a specified action (which can be to ignore the signal). Interrupt signals can cause a program to suspend itself temporarily to service the interrupt. In the course of processing the trap or interrupt, the system figures out the address that the load or store would have used, validates that it is legal, and then emulates it if it would be legal. Two exemplary products that take this approach are “Purify” from Rational and “RTC” from Sun Microsystems, Inc. See also, Sun Microsystems, “Debugging a Program With dbx,” March 2004, Rev A, part number 817-5063-10. Furthermore, the CHECK command activates the RTC feature in Sun Microsystems' debugger.
This approach is typically superior to the malloc debuggers in several respects. First, it can perform protection on arbitrary granularity. Second, it can validate any memory reference and not just those referring to addresses on the heap. However, it does not associate memory references with the program element that originated the reference and so it is still susceptible to the type of memory overrun that happens to hit a valid block of memory. For example, in the following situation:
REAL x(10), y(10)
a=x(11)
The reference to x(11) is out of bounds for x, but hits a legal memory address (the address for y(1) because y immediately follows x in memory). Because binary instrumentation does not associate memory references with the program elements that generated them, it does not detect that the reference to x(11) is illegal. If it associated memory references with the program element that generated them, then it would associate the reference to x(11) with x and notice that the memory reference is out of bounds for x. However, as it is, the reference to x(11) hits a legal place in memory so the overrun is not detected.
Some programming languages permit pointers to serve both as pointers and arrays, sometimes called “overloading” the pointer. In these languages, pointers may be referenced in code as an array using array syntax. For example, a pointer ab may be referenced as a pointer *ab or an array ab[j]. Other languages do not permit pointers to be accessed or referenced as an array in such a manner. In these languages, arrays are treated as arrays, and pointers are treated as pointers, and each are accessed and referenced as such accordingly. In these languages, array syntax is not used to access a pointer. Programming environments that permit pointers to serve as pointers and arrays may create additional difficulties related to array bounds checking.
Therefore, a need has long existed for a method and system that overcome the problems noted above and others previously experienced.