Conventionally, dynamic instrumentation mechanisms such as profiling/tracing infrastructure run in the exception context. Such mechanisms typically work by modifying the text stream at the desired location to induce a software exception, trap the exception and as a consequence, are able to run the instrumentation code, which then gathers the required data. It is especially critical that the instrumentation code run is robust, free of errors and does not induce any subsequent exceptions, potentially causing irreparable damage to the system, and more so when the instrumentation code is run in kernel mode to gather relevant kernel data.
Typically, operating systems define a default exception handler for every exception, such that, when an exception occurs, the operating system saves the current system state (specifically, the registers at the time of exception), and this system state is passed on to the system's default exception handler. In normal circumstances, the system's default exception handler executes and then while returning from the exception handler the system state is restored from the earlier saved system state. Thus, the operating system continues its normal execution after handling the exceptions.
Instrumentation of the software can be done in various ways, for example:
By hooking the system exception handlers to call into the instrumentation code, where the exceptions are not induced, but normally occur as a normal consequence of program execution (such as page-faults). By hooking the page-fault exception handler itself, it becomes possible to run instrumentation code; or
Inducing exceptions by inserting code in the normal program stream and/or using the platform-provided hardware debug facilities to generate exceptions when the inserted code is executed.
When such instructions are executed, the exception handler is invoked. This exception handler in turn executes the instrumentation code, and is configured to collect the required information in the exception context.
FIG. 1A illustrates an embodiment of conventional exception processing 100. Every exception 110 will have associated with it, a default handler 120 that the operating system will run, in case the said exception 110 occurs. In the conventional case, when an exception 110 occurs, the default exception handler 120 will run and take care of executing specific actions to recover from the exception 110. FIG. 1B illustrated an embodiment of exception processing 101 with instrumentations code. A program text 105 such as a set of instructions is executed on a system. Most instrumentation code 130 run off the system exception handler 120. Given that the system state is provided as input at the entry to the exception handler 110, the instrumentation code 130 will also have access to the instrumentation code. The exception stage and the return from exception stage occur between immediately before and after the system exception handler 120 is executed.
FIG. 1C illustrates an embodiment of exception processing 102 using setjmp( ) and/or longjmp( ) trampolines as in the prior art. Here, the function calls or trampolines setjmp( ) 125 and/or longjmp( ) 150 to try and recover from nested exceptions. When the program 105 is instrumented a first exception occurs, which is handled by the first system exception handler 120, preferably a known system state wherein the trampoline setjmp( ) 125 is assigned to the first system exception and is configured to save the register context after which an instrumentation code 130 gets executed. Under normal circumstances, the instrumentation code 130 executes fine and returns to the trampoline setjmp( ) 125, from where it is possible to return to the first system exception handler 120.
If the instrumentation code 130 generates another further exception, which is typically encountered in the case of nested exception, an entry is recognized to be due to an error 140 that occurred due to instrumentation code 130, the error being caused for example by bugs in the instrumentation code 130, while already in exception context 120 and therefore the trampoline longjmp( ) 150 is executed so as to do a jump to the known sane system state in the first system exception handler 120. This will lead to a situation where the number of exception entries is more than the number of exception returns. This will lead to a situation where the exception stack is not offset correctly due to unbalanced entry/return from exceptions. We may end up with a situation where the stack corruption and/or return from interrupt exceptions lead to incorrect system operation down the line.
Without a way to provide a method to restore the system from nested kernel exception and bringing the system back to sane state during instrumentation, the promise of this technology may never be fully achieved.