As personal computers and workstations have become more and more powerful, makers of mainframe computers have undertaken to provide features which cannot readily be matched by these smaller machines in order to stay viable in the marketplace. One such feature may be broadly referred to as fault tolerance which means the ability to withstand and promptly recover from hardware faults and other faults without the loss of crucial information. The central processing units (CPUs) of mainframe computers typically have error and fault detection circuitry, and sometimes error recovery circuitry, built in at numerous information transfer points in the logic to detect and characterize any fault which might occur.
The CPU(s) of a given mainframe computer comprises many registers logically interconnected to achieve the ability to execute the repertoire of instructions characteristic of the CPU(s). In this environment, the achievement of genuinely fault tolerant operation, in which recovery from a detected fault can be instituted at a point in a program immediately preceding the faulting instruction/operation, requires that one or more recent copies of all the software visible registers (and supporting information also subject to change) must be maintained and constantly updated. This procedure is typically carried out by reiteratively sending copies of the registers and supporting information (safestore information) to a special, dedicated memory or memory section.
When a fault occurs and analysis determines that recovery is possible, the safestore information is used to reestablish the software visible registers in the CPU with the contents held recently before the fault occurred so that restart can be instituted or tried from the corresponding place in program execution.
The logical design of modern CPUs, particularly mainframes, is enormously complex. Inevitably, logic design errors are present as the design process proceeds. If the specific hardware in which a design error is discovered is still in development, it can simply be corrected, sometimes with appropriate changes in firmware. However, if the faulting condition occurs so rarely and is so elusive that it is only discovered after systems have been installed for commercial and/or other field operation, the correction of the hardware/firmware (for example, by replacing an integrated circuit having the design error with one in which the error has been corrected) can be time consuming. Similarly, if a rarely occurring hardware fault is discovered during development, there may be good reason, such as meeting delivery schedules, to forego any immediate attempt to effect a definitive hardware/firmware correction. In both instances, a conventional, and generally effective, prior art approach has been to set up the CPU firmware to detect and refer faults to a fault processing module written into the operating system.
There are, however, drawbacks to this approach. When design errors are discovered, the resolution process for the resulting fault must be incorporated into the fault handling module operating system itself. This can be not only a formidable task, but the revisions to the operating system in all the systems in existence can be disruptive of normal operation. Further, some mainframe CPUs are configured to run under a plurality of operating systems. This requires changes to the fault processing modules of each operating system which can be accommodated by the CPUs. Still further, certain system design errors are often worked out, even after commercialization, as very large scale integrated circuits are modified and the chips changed out in individual installations. As a result, a feature in the operating system(s) introduced to handle a problem which no longer exists may adversely affect performance and certainly increases the amount of code in the operating system. It is to the solution of these related problems that the present invention is directed.