The present invention generally relates to data processing system fault handling and more specifically to preserving the ability to obtain a valid dump printout for analysis during certain operations, most particularly after the occurrence of a fault-on-fault condition and also to increasing the chances that a useable dump can be obtained and a full system restart avoided after processing a fault-on-fault.
In a typical data processing system, input and output completions are typically signaled by interrupts. This concept was extended to cover other external as well as internal events. Herein, a distinction will be made between responding to external events, herein termed xe2x80x9cinterruptsxe2x80x9d, and responding to internal events, herein termed xe2x80x9cexceptionsxe2x80x9d or xe2x80x9cfaultsxe2x80x9d. It should be noted that the distinction between interrupts and exceptions or faults is somewhat arbitrary, as some architectures do not make such a distinction.
An exception then is the happening of an internal event within a computer within a data processing system. Exception handling is the action taken by a computer processor to respond to the exception. Some typical exceptions are page faults, zero divide, supervisory call, illegal instruction, privileged instruction (when not in a mode allowing execution of such), security violations, timer or decrementer expiration, and traps. Other exceptions are within the ambit of this disclosure.
Typically, exception handling or exception processing involves diverting control or instruction flow from where the computer processor was executing prior to the exception to an exception handling routine. Typically again, there will be a different exception handling routine for each exception type and even subtype. The exception handling routines are typically a portion of the operating system controlling each computer processor in the data processing system. The exception handling routine for a given exception will typically be programmed to determine how to handle a particular exception type. For example, the task that attempts to execute a privileged instruction, performs a security violation, or a zero divide, will typically be aborted by the operating system, after providing for the possibility of dumping the job containing the task. On the other hand, in the case of a page fault, the operating system will typically suspend the task causing the page fault, initiate reading the requested page of memory from disk, and dispatch another task to execute. The task causing the page fault will be re-dispatched later after the missing page has been retrieved from disk. In the case of expiration of a timer, the executing task is placed on a dispatch queue, and another task is dispatched.
It should be noted here that the above mechanisms require that the exception handler save the current execution environment in the computer processor so that it can be returned to at some later time. Upon completion of exception processing for a given exception, control is returned to the saved environment, typically at either the instruction causing the exception (for example in responding to a page fault), or at the next instruction after that instruction (for example in responding to a supervisor request). Indeed, this mechanism is the fundamental method used by the dispatcher in a modern operating system to accomplish dispatching of tasks. Partly this is done through the fairly complete control over the information in the saved environment of a task that the operating system has.
Since exception handling is typically part of the operating system controlling a data processing system, and since exception handling routines typically require almost full control of the computer processor, including the ability to execute privileged instructions, and to read and write almost all memory, exception handling routines will typically be entered with the highest possible privilege level. Typically this means that exception handling will be entered in a pre-specified maximum security mode.
In order for a computer processor to respond to an exception, it must be aware of the location of the appropriate exception processing routine. In some data processing systems, such as GCOS(copyright) 8 from the assignee of this invention, the entry descriptor for a general exception or fault handling routine is retrieved from a specified location (octal 032) in memory and evaluated. The entry descriptor specifies the environment for the exception processing routine, including which segments are visible, the routine starting address, and what privileges to enable. It is treated by the computer processor almost like an ICLIMB subroutine call, laying down a Safe Store Stack Frame containing the saved environment. An OCLIMB instruction can be later executed to return control back to the location of the exception or fault. Within the fault handling routine (titled xe2x80x9cFaultxe2x80x9d), a determination is made as to the fault (or exception) code causing the exception. This then is used to invoke the appropriate exception processing routine for that type of fault, again with an xe2x80x9cICLIMBxe2x80x9d instruction.
Other mechanisms are typically used in less secure data processing systems. For example, in the Intel X86 architecture, there is a fault or exception vector stored at a specified location in memory containing a number of exception handling routine addresses. When an exception occurs, control is transferred to the address at the specified location in the exception vector corresponding to that exception type. As noted above, the environment of the exception handling is automatically set to a pre-specified maximum security state. Most of the environmental saving and restoring required is done by general purpose instructions that store and later load processor registers.
Somewhat more sophisticated is the exception processing in a Motorola or IBM PowerPC(copyright) processor environment. Instead of having an exception (or fault) vector containing addresses of exception handling routines, the exception handling routine for each exception handling type begins execution in response to the occurrence of the exception being handled, at the first word in a block of memory at a specified location in memory. Each exception type has its own block of memory starting at its specified location in memory. The PowerPC architecture contains a couple of enhancements in sophistication over the X86 architecture discussed before. First, instead of one set of exception routine routines or exception vector, there are two. The selection of which of the two to utilize is determined by a static bit in a reserved status register in each computer processor. Typically, one set of exception routines are utilized at system startup. The bit is then toggled, and the other set of exception routines is then utilized thereafter. Second, instead of always initiating exception processing with the same high security environment, the PowerPC architecture specifies slightly different processing environments for the start of exception processing for different exception types.
Other data processing system architectures utilize similar mechanisms to the above.
There are problems with all of the above mechanisms. One problem with the GCOS 8 mechanism disclosed above is that it requires the equivalent to two ICLIMB instructions to enter the appropriate fault or exception handling routine, and two OCLIMB instructions to return. These are some of the most expensive instructions in the GCOS 8 processor instruction repertoire to execute in terms of computer instruction cycles, typically taking over 100 cycles each to execute. Thus, it would be preferable to be able to perform fault processing more efficiently, with the expenditure of fewer instruction cycles.
Both the X86 and PowerPC approaches suffer from being unable to automatically fine tune the processor environment to the exception type being processed. Thus, with the minor exceptions noted above for the PowerPC architecture, all exception handling in both architectures begins execution in the identical processor environment. This means that the same memory is visible to all fault handling routines, as well as most (PowerPC) or all (X86) of the same processor privileges are in effect.
One problem that is common to all three approaches or mechanisms is that in certain instances, the exception vector or exception handling routines are mistakenly overlaid by other data. This is compounded because these are typically in physical memory with low fixed addresses. In the X86 environment, given its minimal security, this overlaying happens frequently. However, even in the most secure operating system, such as GCOS 8, it still happens. One major cause of this is issuance of erroneous input/output (I/O) requests.
The problem that this causes is that exception processing will thereafter fail, when the processor is unable to either find the required exception processing routines, or if it can find such, it cannot execute them, as they no longer exist, having been overwritten. This sort of problem is often hard to diagnose since one of the functions that can result from exception processing is the generation of a dump of the processor and its memory. No exception processing typically means no dump. One advantage of the higher security GCOS 8 architecture is that overlaying of the entry descriptor for the fault handler is easily detected as it typically no longer is a valid entry descriptor.
When a computer processor causes an exception or fault while processing an exception or fault, it is termed here xe2x80x9cfault-on-faultxe2x80x9d. In the prior art, this typically ultimately resulted in halting the computer processor, if not explicitly, at least implicitly. In the above scenario, when either the exception vector, or the exception processing routines, are overlaid, even when exceptions are prioritized, the processor will ultimately end up attempting to process some exception while in the process of processing that very same exception. For example, if the exception handling routines have been overlaid, then the processor will (hopefully) recognize an illegal instruction exception while executing code in the overlaid area. If this in turn results in attempting to execute code in the overlaid area, recovery is impossible.
The GCOS 8 architecture does provide a partial solution to the xe2x80x9cfault on faultxe2x80x9d problem outlined above. When a program fault or exception is detected during fault processing, a second fault or exception handling routine is invoked, instead of the first one described above. It is entered by loading and evaluating a second entry descriptor located at another specified location in memory. However, this is not a complete solution since it sometimes happens that the same situation that resulted in the second fault (the xe2x80x9cfault within faultxe2x80x9d) also resulted in either the entry descriptor for the second fault handler being overlaid, or the code for the second fault handler itself being overlaid.
The fault handling procedures set forth in the above-identified related patent applications provide significant improvements in the art of fault handling in fault tolerant data processing systems. However, conditions remained in which it was impossible to obtain a valid dump to provide insight into a system failure, particularly those caused by software errors. The present invention serves to significantly enhance the chances that a valid dump can be obtained when a fault-on-fault condition occurs with the additional facility that the dump can be rendered automatic and can lead to an operating system restart rather than the need for a full system boot requiring direct operator intervention.