Most computer users have experienced times when their computer seemingly has “lost its mind” and starts behaving in seemingly unexplainable ways. For example, sometimes we command the computer to do something—but instead of doing what we ask, the computer “stops responding” and needs to be “rebooted” (e.g., turned off and back on again). This process can waste significant time while the computer restarts. Work product is sometimes lost—frustrating users to no end.
Ultimately, most such problems are caused by programming errors (sometimes called “bugs”). As computer programs become increasingly complex, it is more difficult for the people writing the computer code to take into account every possible condition that the computer program may encounter. Unfortunately, a computer program will “break” if the code encounters an undefined condition it does not “know” how to handle. This can cause serious problems. Consider for example if the software controls an airplane autopilot, a missile guidance system or a hospital life support system.
Another range of problems relates to attackers taking advantage of undefined computer program behavior to do harm. Several of the undefined behaviors of C and C++ have received much attention in the popular press as well as technical journals, because their effects have inflicted billions of dollars of damage in the USA and worldwide. In particular, the “buffer overflow” (also known as “buffer overrun”) and “null pointer indirection” behaviors have created vulnerabilities in widely-used software from many different vendors. This problem of buffer overflows is no longer an obscure technical topic. This is the vulnerability through which most worms and viruses attack. The worldwide total costs due to malicious hacker attacks during 2002 have been estimated to be between 40 and 50 billion USD; costs for 2003 were estimated between 120 and 150 billion USD. See e.g., David Berlind, “Ex-cybersecurity czar Clarke issues gloomy report card” (ZDNet TechUpdate Oct. 22, 2003).
An international standard has been developed for the programming language C, which is designated ISO/IEC 9899:2002(E). Similarly, an international standard has been developed for the programming language C++, which is designated ISO/IEC 14882:2003(E). Each of these standards defines certain situations using the category of “undefined behavior”. The C Standard contains the following definition: “3.4.3 undefined behavior: behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements. NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).” The C++ Standard contains a similar definition: “1.3.12 undefined behavior: behavior, such as might arise upon use of an erroneous program construct or erroneous data, for which this International Standard imposes no requirements. Undefined behavior may also be expected when this International Standard omits the description of any explicit definition of behavior. [Note: permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message). Many erroneous program constructs do not engender undefined behavior; they are required to be diagnosed.]”
Some undefined behaviors can be eliminated by using techniques already known in the current art. The next sections will describe some exemplary such techniques.
Design-time Choices [dt]
Several undefined behaviors can be addressed by design choices; these undefined behaviors are marked with “dt” in column one of the table below. In general, the guiding principle behind these design choices is that non-portable behavior is generally not as bad as undefined (unsafe) behavior. For example; byte-ordering affects the numeric value of results, but so long as address bounds are not exceeded, byte-ordered integer values produce something well-defined on each hardware platform.                a. The representation of a null pointer can be all-bits-zero.        b. The representation of pointers can be binary two's-complement with non-signaling wraparound.        c. Every possible binary value can be interpreted as a valid data element. Every data value can be fetched safely; in that sense, there are no “trap representations”. A “trap” can result if fetch or store of an invalid pointer is attempted, but not upon calculation or comparison of addresses. Therefore, uninitialized memory can be fetched safely. An incompletely-read buffer after a read error (such as in Standard C subclauses 7.19.7.2, 7.19.7.7, 7.24.3.2, etc) still contains data bytes which will not cause traps upon fetch.        d. A request to the allocation functions malloc and calloc to allocate zero bytes can cause the allocation of the smallest non-zero allocation.        e. If the number-of-elements argument is zero, string and wide-string and sorting and searching functions can do-nothing gracefully.        f. The sorting and searching functions can be limited to no more than an implementation-defined maximum number of iterations.        g. The algorithms for converting between wide characters and (narrow) characters can produce deterministic results for all inputs, in either direction. Therefore, when a stream was written wide-oriented and read byte-oriented, the behavior can be implementation-defined and not undefined, and similarly for a stream written byte-oriented and read wide-oriented.        h. The wcstok function can be implemented so that, if it is invoked with a null pointer, then the pointer argument need not be equal to the pointer argument of the previous, but can require only that the “saved” pointer must designate some non-const array of characters, null-terminated.        i. The wcstok and strtok functions can be implemented so that, if the first invocation passes a null pointer, the function can ignore it and return a null pointer; alternatively, the function can invoke a safe termination such as ss_unwind (see below).        
The methods shown in this section can be used to eliminate the following undefined behaviors:
SSM#C-Std#Descriptiondtc7.20.4.6A command is executed through the system function in a way thatis documented as causing termination or some other form ofundefined behaviordtc7.20.5A searching or sorting utility function is called with an invalidpointer argument, even if the number of elements is zerodtc7.20.5The comparison function called by a searching or sorting utilityfunction alters the contents of the array being searched or sorted,or returns ordering values inconsistentlydtc7.20.5.1The array being searched by the bsearch function does not haveits elements in proper orderdtc7.20.6.1The abs, labs (or llabs, in c99) function would return a value thatcannot be represented (absolute value of the most-negative twos-complement value).dtc7.20.6.2The div, ldiv (or lldiv in C99) function would return a value thatcannot be represented.dtc7.20.7The current conversion state is used by a multibyte widecharacter conversion function after changing the LC_CTYPEcategorydtc7.21.1, c7.24.4A string or wide string utility function is called with an invalidpointer argument, even if the length is zerodtc7.21.4.5,The contents of the destination array are used after a call to thec7.23.3.5,strxfrm, strftime, wcsxfrm, or wcsftime function in which thec7.24.4.4.4,specified length was too small to hold the entire null-terminatedc7.24.5.1resultdtc7.21.5.8,The first argument in the very first call to the strtok or wcstok is ac7.24.4.5.7null pointerdtc7.24.2.11The argument corresponding to an s specifier without an lqualifier in a call to the fwprintf function does not point to a validmultibyte character sequence that begins in the initial shift statedtc7.24.4.5.7In a call to the wcstok function, the object pointed to by ptr doesnot have the value stored by the previous call for the same widestringdtc7.24.6An mbstate_t object is used inappropriatelydtc7.25.1The value of an argument of type wint_t to a wide characterclassification or case mapping function is neither equal to thevalue of WEOF nor representable as a wchar_tdtc7.25.2.2.1The iswctype function is called using a different LC_CTYPEcategory from the one in effect for the call to the wctype functionthat returned the descriptiondtc7.25.3.2.1The towctrans function is called using a different LC_CTYPEcategory from the one in effect for the call to the wctrans functionthat returned the descriptiondtc7.4The value of an argument to a character handling function isneither equal to the value of EOF nor representable as anunsigned chardtc7.19.2A byte input, coutput function is applied to a wide-orientedstream, or a wide character input, coutput function is applied to abyte-oriented streamdtc7.13.2.1After a longjmp, there is an attempt to access the value of anobject of automatic storage class with non-volatile-qualified type,. . .dtc7.13.2.1. . . local to the function containing the invocation of thecorresponding setjmp macro, that was changed between thesetjmp invocation and longjmp calldtc6.5.16.1An object containing no pointers is assigned to an inexactlyoverlapping object or to an exactly overlapping object withincompatible typedtc6.5.16.1An object containing pointers is assigned to an inexactlyoverlapping object or to an exactly overlapping object withincompatible typedtc7.14.1.1A signal occurs other than as the result of calling the abort orraise function, and the signal handler refers to an object withstatic storage duration other than by assigning a value to anobject declared as volatile sig_atomic_t, or . . .dtc7.14.1.1. . . calls any function in the standard library other than the abortfunction, the _Exit function, or the signal function (for the samesignal number)dtc6.2.6.1A trap representation is produced by a side effect that modifiesany part of the object using an lvalue expression that does nothave character typedtc6.2.6.1A trap representation is read by an lvalue expression that doesnot have character typedtc6.3.1.4Conversion to or from an integer type produces a value outsidethe range that can be representeddtc6.3.1.5Demotion of one real floating type to another produces a valueoutside the range that can be representeddtc6.4.5The program attempts to modify a string literaldtc6.5Between two sequence points, an object is modified more thanonce, or is modified and the prior value is read other than todetermine the value to be storeddtc6.5.6The result of subtracting two pointers is not representable in anobject of type ptrdiff_tdtc6.5.7An expression having signed promoted type is left-shifted andeither the value of the expression is negative or the result ofshifting would be not be representable in the promoted typedtc6.5.7An expression is shifted by a negative number or by an amountgreater than or equal to the width of the promoted expressiondtc6.5accAn object has its stored value accessed other than by an lvalue ofan allowable typedtc6.7.3An attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualifiedtypedtc6.7.3An attempt is made to refer to an object defined with a volatile-qualified type through use of an lvalue with non-volatile-qualifiedtypedtc6.7.8The value of an unnamed member of a structure or union is useddtc6.9.1The } that terminates a function is reached, and the value of thefunction call is used by the callerdtc7.11.1.1The program modifies the string pointed to by the value returnedby the setlocale functiondtc7.11.2.1The program modifies the structure pointed to by the valuereturned by the localeconv functiondtc7.13.2.1The longjmp function is invoked to restore a nonexistentenvironmentdtc7.14.1.1A signal handler returns when the signal corresponded to acomputational exceptiondtc7.14.1.1A signal is generated by an asynchronous signal handlerdtc7.14.1.1A signal occurs as the result of calling the abort or raise function,and the signal handler calls the raise functiondtc7.14.1.1The value of errno is referred to after a signal occurred other thanas the result of calling the abort or raise function and thecorresponding signal handler obtained a SIG_ERR return from acall to the signal functiondtc7.19.5.2The stream for the fflush function points to an input stream or toan update stream in which the most recent operation was inputdtc7.19.6.1,A % conversion specifier is encountered by one of the formattedc7.19.6.2,input, coutput functions, but the complete conversion specificationc7.24.2.1,is not exactly %%c7.24.2.2dtc7.19.6.2,A c, s, or [ conversion specifier with an l qualifier is encounteredc7.24.2.2by one of the formatted input functions, but the input is not a validmultibyte character sequence that begins in the initial shift statedtc7.19.7.2,The contents of the array supplied in a call to the fgets, gets, orc7.19.7.7,fgetws function are used after a read error occurredc7.24.3.2dtc7.19.8.1A partial element read by a call to the fread function is useddtc7.19.8.1,The file position indicator for a stream is used after an errorc7.19.8.2occurred during a call to the fread or fwrite functiondtc7.20.3A non-null pointer returned by a call to the calloc, malloc, orrealloc function with a zero requested size is used to access anobjectdtc7.20.3.3The value of the object allocated by the malloc function is useddtc7.20.3.4The value of any bytes in a new object allocated by the reallocfunction beyond the size of the old object are useddtc7.20.4.5,The string set up by the getenv or strerror function is modified byc7.21.6.2the programText Streams and Character Representations [code]
An exemplary implementation can use a specific choice among the Unix/POSIX/Linux encoding of text files (with LF line terminators), the Macintosh encoding of text files (with CR line terminators), or the Microsoft Windows encoding of text files (with CR/LF line terminators). All mbstate_t conversions can produce implementation-defined results, even after changing the LC_CTYPE category.
An implementation can make truncated-result behavior well-defined in strxfrm, strftime, wcsxfrm, or wcsftime.
The multibyte functions can behave gracefully when given a sequence not in the initial shift state, or when given any mbstate_t object.
The wide-character classifying and conversion functions can be well-defined for any wint_t input and for any LC_CTYPE setting.
The methods shown in this section can be used to eliminate the following undefined behaviors:
SSM#C-Std#Descriptioncodec7.19.2Use is made of any portion of a filebeyond the most recent wide characterwritten to a wide-oriented streamcodec7.19.6.1, c7.19.6.2,The format in a call to one of thec7.23.3.5, c7.24.2.1,formatted input, coutput functionsc7.24.2.2, c7.24.5.1or to the strftime or wcsftimefunction is not a valid multibytecharacter sequence that begins andends in its initial shift stateSecure Library [slib]The secure library enhancements proposed to ISO/IEC JTC 1 SC22/WG14 will eliminate many opportunities for undefined behavior. Furthermore, if a formatted I/O function produces more than INT_MAX chars of output, then it can return INT_MAX.
The methods shown in this section can be used to eliminate the following undefined behaviors:
SSM#C-Std#Descriptionslibc7.19.6.1, c7.19.6.3,The number of charactersc7.19.6.8, c7.19.6.10transmitted by a formattedoutput function is greaterthan INT_MAXSs_unwind [longj]
The longjmp function (and any other functions which “unwind” the stack), can check whether execution of atexit-registered functions has started. If so, one of the following implementation-defined actions can be performed: cause a return from the function that invoked the unwind or longjmp function, invoke an “extreme exit” cleanup function; or invoke the abort function. Optionally, at the point of catching the ss_unwind, a system sanity check can be performed before continuing or re-starting.
The methods shown in this section can be used to eliminate the following undefined behavior:
SSM#C-Std#Descriptionlongjc7.20.4.3During the call to a function registeredwith the atexit function, a call ismade to the longjmp function that wouldterminate the call to the registeredfunctionSpecial Behavior of atexit functions [atex]
The exit function can check whether execution of the exit function has previously started. If so, one of the following implementation-defined actions can be performed: invoke an “extreme exit” cleanup function; or invoke the abort function.
The methods shown in this section can be used to eliminate the following undefined behavior:
SSM#C-Std#Descriptionatexc7.20.4.3The program executes more thanone call to the exit functionArithmetic Exceptions [exc]
If at compile-time the right operand of division or remainder is zero, a fatal diagnostic message can be produced. In Debug mode, if at run-time the right operand of division or remainder is zero, an “unwind” (such as ss_unwind) can be invoked, and the implementation may throw an exception of an implementation-defined type. In non-Debug mode, if at run-time the right operand of division or remainder is zero, the result can be the maximum value of the result type, which for a floating-point type may be an infinity.
If at compile-time the left operand of division or remainder is the maximum negative value of its type and the right operand is −1, a fatal diagnostic message can be produced. In Debug mode, if at run-time the left operand of division or remainder is the maximum negative value of its type and the right operand is −1, an “unwind” (such as ss_unwind) can be invoked, and the implementation may throw an exception of an implementation-defined type. In non-Debug mode, if at run-time the left operand of division or remainder is the maximum negative value of its type and the right operand is −1, the result can be the maximum value of the result type.
If at compile-time the result of an integral arithmetic operation is too large for its type, a fatal diagnostic message can be produced. In Debug mode, if at run-time the result of an integral arithmetic operation is too large for its type, an “unwind” (such as ss_unwind) can be invoked, and the implementation may throw an exception of an implementation-defined type. In non-Debug mode, if at run-time the result of an integral arithmetic operation is too large for its type, the result can be the value of the twos-complement operation with wrap-around.
The methods shown in this section can be used to eliminate the following undefined behaviors:
SSM#C-Std#Descriptionexcc6.5.5The value of the second operandof the / or % operator is zeroexcc6.5excAn exceptional condition occursduring the evaluation of an expressionControl of Dangling Pointers [dang]
One category of undefined behavior arises from accessing freed storage. Furthermore, each freed pointer must previously have been allocated.
These undefined behaviors can be eliminated by use of garbage collection, either conservative (see, e.g., Hans-J Boehm, “A Garbage Collector for C and C++”,. (http://www.hpl.hp.com/personal/Hans_Boehm/gc/)) or accurate (see e.g., Fergus Henderson, “Accurate Garbage Collection in an Uncooperative Environment”, ISMM'02, June 2021, 2002, Berlin, Germany, ACM 1581135394/02/0006), supplemented with the following special treatment of pointers to terminated stack frames. Directly assigning an address in the current function's stack frame to a longer-life pointer can be prohibited. Define a pointer-retainer function as a function which stores a pointer argument in heap or static storage. Passing a pointer to stack to a pointer-retainer function can be prohibited. (Whatever data resides in the stack can be copied to heap or to static, to avoid the prohibition.)
Memory that could contain pointers can be initialized to zeroes. Therefore, (as in Boehm conservative garbage-collection) malloc allocates space that might have pointers in it, so the space is zero-filled. There can be a new attribute to describe a state named e.g “not_ptrs” for any storage which is guaranteed not to contain pointers, and a different version of malloc can be used for such storage (equivalent to GC_malloc_atomic in the Boehm library):
void * malloc_not_ptrs(size_t n);
If storage with the not_ptrs attribute is cast to pointer-to-anything, then a fatal diagnostic message can be produced. The not_ptrs attribute can be removed from any storage by assigning zero to the bytes of the storage; a byte-oriented alias is mandatory (char, or unsigned char, or a library functions such as memset which modifies the bytes of memory).
An alternative method for prevention of dangling pointers is known (see e.g., Todd M. Austin et al., Efficient Detection of All Pointer and Array Access Errors, Proceedings of the ACM SIGPLAN '94 Conference on Programming Language Design and Implementation, June 1994), which is a feasible solution for an implementation which operates entirely in BSAFE mode (see below).
The methods shown in this section can be used to eliminate the following undefined behaviors:
SSM#C-Std#Descriptiondangc7.20.3.2,The pointer argument to the free or reallocc7.20.3.4function does not match a pointer earlierreturned by calloc, malloc, or realloc, orthe space has been deallocated by a callto free or reallocdangc7.20.3The value of a pointer that refers tospace deallocated by a call to thefree or realloc function is useddangc6.2.4An object is referred to outside of itslifetimedangc6.2.4The value of a pointer to an object whoselifetime has ended is usedExclusion of C 1999 Extensions [c99]
The exemplary implementation described herein does not specifically address those extensions added in the 1999 revision of C which are not features of C++. Further refinements can be produced to address the undefined behaviors related to those extensions:
SSM#C-Std#Descriptionc99c6.7.2.1A attempt is made to access, or generate a pointer to just past, aflexible array member of a structure when the referenced objectprovides no elements for that arrayc99c6.7.4A function with external linkage is declared with an inline functionspecifier, but is not also defined in the same translation unitc99c6.7.5.3A declaration of an array parameter includes the keyword staticwithin the [ and ] and the corresponding argument does notprovide access to the first element of an array with at least thespecified number of elementsc99c7.3.4, c7.6.1,The CX_LIMITED_RANGE, FENV_AX2ESS, or FP_CONTRACTc7.12.2pragma is used in any context other than outside all externaldeclarations or preceding all explicit declarations and statementsinside a compound statementc99c7.12.3, c7.12.14An argument to a floating-point classification or comparison macrois not of real floating typec99c7.15.1.2The va_copy macro is called to initialize a va_list that waspreviously initialized by va_start or va_copy without an interveninginvocation of the va_end macro for the same va_listc99c7.22A complex argument is supplied for a generic parameter of a type-generic macro that has no corresponding complex functionc99c7.22The type of an argument to a type-generic macro is not compatiblewith the type of the corresponding parameter of the selectedfunctionc99c7.6.1Part of the program tests floating-point status flags, sets floating-point control modes, or runs under non-default mode settings, butwas translated with the state for the FENV_AX2ESS pragma??off??c99c7.6.2The exception-mask argument for one of the functions that provideaccess to the floating-point status flags has a nonzero value notobtained by bitwise OR of the floating-point exception macrosc99c7.6.2.4The fesetexceptflag function is used to set floating-point statusflags that were not specified in the call to the fegetexceptflagfunction that provided the value of the corresponding fexcept_tobjectc99c7.6.4.3, c7.6.4.4The argument to fesetenv or feupdateenv is neither an object setby a call to fegetenv or feholdexcept, nor is it an environmentmacroc99c7.8.2.1, c7.8.2.2,The value of the result of an integer arithmetic or conversionc7.8.2.3, c7.8.2.4,function cannot be representedc7.20.6.1,c7.20.6.2, c7.20.1c99c6.7.3.1A restrict-qualified pointer is assigned a value based on anotherrestricted pointer whose associated block neither began executionbefore the block associated with this pointer, nor ended before theassignmentc99c6.7.3.1An object which has been modified is accessed through a restrict-qualified pointer to a const-qualified type, or through a restrict-qualified pointer and another pointer that are not both based on thesame object
It would be desirable to eliminate, with commercially acceptable efficiency, further undefined behaviors in the execution of programs in the “intersection” of C and C++; that is, in C programs which use only the features described in the C++ standard, and of C++ programs which use only the features described in the C standard.
It would furthermore be desirable to automate (e.g., through compiler design) techniques to provide safe secure development of software, including but not limited to techniques for addressing undefined behavior in the C and C++ programming languages.
Advantageous features provided by exemplary illustrative non-limiting implementations of the technology herein include:                A Safe Secure Compiler (“SSC”) which produces Safe Secure Object Files or fatal diagnostic messages.        A Safe Secure Inputs Check-List (“SSICL”) which records checksum information for the inputs to the execution of a Safe Secure Compiler.        A Safe Secure Bounds Data File (“SSBDF”) which records Requirements and Guarantees for the defined and undefined symbols in one or more corresponding object files, as well as checksum information.        A Safe Secure Linker (“SSL”) which combines object files and the corresponding Safe Secure Bounds Data Files , producing either fatal link-time diagnostics or a Safe Secure Executable Program.        A Safe Secure Semantic Analyzer (“SSSA”) which uses the parse tree to determine Requirements and Guarantees.        A Safe Secure Diagnostic Generator (“SSDG”) which generates fatal diagnostic messages in situations where undefined behavior would result and generates various warning messages to call the programmer's attention to various other situations.        A Safe Secure Code Generator (“SSCG”) which generates object code which is free from the designated sets of undefined behaviors (including “buffer overflow” and “null pointer indirection”).        A Safe Secure Pointer Attribute Hierarchy (“SSPAH”) which controls the inference of attributes based upon other attributes.        A Safe Secure Pointer Attribute Predicate Table (“SSPAPT”) which controls the determination of attributes resulting from predicate expressions.        A Safe Secure Bounds Data Table (“SSBDT”) which tabulates the Guarantees and Requirements for expressions, sub-expressions, declarations, identifiers, and function prototypes.        A Safe Secure Interface Inference Table (“SSIIT”) which controls the inference of Requirements on the interface of each externally-callable function.        A Safe Secure Bounds Data Symbol Table (“SSBDST”) which tabulates the Requirements and Guarantees for defined and undefined symbols during the Safe Secure Linking process.        A Safe Secure Link-Time Analyzer (“SSLTA”) which matches Requirements to Guarantees for function-call, external array, and external pointer linkage contexts.        A Safe Secure Link Diagnostic Generator (“SSLDG”) which generates a fatal diagnostic at link-time if any Requirement is unsatisfied; this prevents the production of any executable program.        