a. Field of The Invention
This invention relates to data processing systems having provision for executing instructions in an overlapped or pipelined fashion, and for detecting errors associated with that execution. More particularly, this invention relates to data processing systems which must isolate the effects of errors to a specific portion of the instruction stream being executed.
b. Related Art
As data processing systems have increased in speed and complexity, the extent to which the execution of multiple instructions is overlapped has increased. One consequence of this is that a random error occurring in such a system may not be easily isolated to one instruction, and may in fact affect the execution of several instructions. It is known in the art to provide a computer system with a mechanism which will discard and retry all instructions which may have been affected by a given error, and thus to insure that error is not reflected in the results of program execution. Such mechanisms are described, for example, in U.S. Pat. No. 4,912,707 to Kogge et al. and U.S. Pat. No. 4,044,337 to Hicks et al. (both assigned to International Business Machines Corporation).
Such retry mechanisms can not hide the effects of every error, however. For example, a particular problem is posed by errors which have occurred after an instruction has passed the point at which it can be retried. For this and other cases where retry is not successful in completely hiding the effects of the error from the program, the system will conventionally alert the program about the error via a "machine check" interruption.
When a machine check is presented to it, the program must determine what corrective action to take. Vital to this effort is information provided by the data processing system as to the extent of the damage which may have occurred, with respect to architected facilities and with respect to the instruction stream. In IBM's Enterprise Systems Architecture/390 (ESA/390) this information is communicated via the Machine Check Interruption Code (MCIC) in storage. The ESA/390 Architecture is described in more detail in the document Enterprise Systems Architecture/390--Principles of Operation (document SA22-7201-0, Available from International Business Machines Corporation, Armonk N.Y.) which is incorporated by reference herein in its entirety.
Various MCIC bits reflect the validity of different architected registers and facilities, while other bits indicate the severity of the error. Of particular interest to this invention is the distinction between the two most severe types of machine check interrupt:
Instruction-Processing Damage (IPD) PA1 System Damage (SD).
Both the IPD and SD machine check types indicate that the error may have affected the results of instructions which have been completed, as indicated by the instruction address reported with the machine check. In other words, both IPD and SD machine checks indicate that the machine state (storage, registers, other architected facilities) may have been corrupted and thus could be inconsistent with the instructions which have been executed. In certain situations it is impossible or undesirable to continue operation when a machine error occurs. In these cases, the processor is placed in a check-stop state and no machine check interrupt is presented.
The primary difference between the IPD and SD machine check types is in the scope of the failure with respect to the instruction stream. An IPD machine check indicates that the effects of the failure are confined to instructions executed? a particular point in the instruction stream, whereas the effects of an error causing an SD machine check are not confined to any particular instruction stream interval. Thus the program, upon receiving an IPD machine check interruption, can be sure that only the currently executing segment of the instruction stream is effected, and specific corrective action can be taken. On the other hand, when the program sees an SD machine check interruption, it cannot depend upon any of the instruction stream or architected state being correct, and more drastic action will be required, often resulting in the termination of the entire program (i.e. the system "crashes"). It is thus advantageous to isolate the effects of errors in as many cases as possible, to maximize the overall availability of the system (both hardware and software) to its users.
The ESA/390 architecture establishes the degree of isolation required for IPD machine checks by requiring that a checkpoint synchronization operation take place at certain points in the instruction stream. A checkpoint synchronization operation essentially consists of insuring that all operations associated with instructions from before that point in the instruction stream (hereafter referred to simply as the "checkpoint") are complete prior to proceeding with instructions from beyond the checkpoint. In this context, "complete" means that these operations are no longer capable of causing an IPD machine check. Once the processor has proceeded past the checkpoint, any error associated with an operation from before the checkpoint must be reported as an SD machine check or must cause the processor to checkstop.
These checkpoints are generally defined to occur immediately before and immediately after certain instructions and all interruptions. In some cases the ESA/390 architecture requires checkpoint synchronization only before or after (not both) a particular instruction, but in the preferred embodiment of this invention it is done both before and after the instruction in all cases, and that implementation will be assumed throughout.
Another characteristic of these checkpoints is they are all also defined in the ESA/390 architecture as Serialization points. The serialization architecture relates to interactions between processors with respect to resources shared by those processors, and thus is distinct in purpose from the checkpoint synchronization architecture. However, the commonality of points requiring these operations leads naturally to commonality in an implementation of the architecture. In a conventional implementation of the ESA/390 architecture (and its ESA/370 predecessors), serialization required a complete cessation of overlapped instruction processing, so that no processing for instructions beyond the serialization point was allowed until all processing from before the serialization point had been completed, including all operand stores. Since checkpoints are a subset of serialization points, the same was true for all checkpoints, and the isolation of instruction processing errors to an interval between two checkpoints was generally straightforward. However, in copending U.S. patent application Ser. No. 07/641,987,now U.S. Pat. No. 5,251,354, entitled "Overlapped Serialization" (to Comfort et al.; filed Jan. 16, 1991; assigned to International Business Machines Corporation), a mechanism is described whereby overlapped instruction processing is made possible even around serialization points consistently with the architectural definition. Given such an implementation of serialization, the fact that all checkpoints are serialization points no longer insures that there is no instruction processing overlap around checkpoints, and a different mechanism is called for to provide error isolation for checkpoint synchronization.
In many systems conforming to ESA/390 Architecture, the processor implementation is such that the detection and isolation of errors is possible, in most cases, on an instruction basis. That is, most instruction processing errors are detected such that no instructions which may have been affected by a given error are allowed to be completed, and changes to architected facilities are managed such that these may be undone if the instruction making a given change is not completed. Thus, when an error is detected all instructions which have not yet been completed are blocked from being completed, all changes to architected facilities which were to have been made by those instructions are discarded, and those instructions can be refetched and reprocessed with no effect from the initial error. In this way, with respect to errors which occur prior to completion, every instruction (not just those required by the architecture) establishes a checkpoint.
Even with such an implementation, however, there are some errors associated with the processing of a particular instruction which cannot block the completion of that instruction. Specifically, errors occurring during the act of completion itself and during the processing of operand stores done by that instruction fall into this category. Errors during the act of completion itself are generally unrecoverable (i.e. a machine check interrupt will be required), but so long as no instructions are completed beyond a checkpoint which is beyond the point of error, that error will be isolated to the current checkpoint interval as required by the architecture and an IPD machine check is allowed. This may be accomplished by blocking subsequent instruction completion when such an error is detected.
Errors during the processing of operand stores, however, pose a more difficult problem with respect to checkpoint synchronization. This is due to the fact that operand stores are held in a buffer until the storing instruction is completed (to allow the store to be discarded if the instruction is never completed), and only after instruction completion are the stores for an instruction written into the storage arrays, including any storage caches. Thus there may be a considerable delay between completion of an instruction and the processing of its operand stores. During that time, it is desirable to be able to complete subsequent instructions in order to maintain the high degree of overlapped instruction processing needed for a high performance processor. At the same time, it is desirable that the processor not proceed past a subsequent checkpoint while stores from before the checkpoint are still being processed, since that would imply that errors during the processing of stores would not be reliably isolated to a checkpoint interval, and thus that such errors would require an SD machine check or processor checkstop. This invention describes a mechanism for providing the required isolation of errors on operand stores while maintaining a high degree of instruction overlap in such a processor.
With any mechanism for checkpoint synchronization, there is a trade-off to be made between system performance and system reliability and availability. Specifically, the point at which the processing of operand stores is deemed "complete" is variable. If this point is chosen earlier in the processing of stores, checkpoints may be deemed "cleared" earlier in time, usually resulting in an improvement in processor performance, but with a concomitant increase in the probability of an error which will require an SD machine check. Conversely, if a point later in the processing of stores is chosen, there will be a longer delay in clearing the checkpoint, with an associated performance penalty, but with a reduced risk of an error which would affect system reliability and availability. This trade-off is independent of the present invention, except insofar as it reduces the overall performance cost of checkpoint synchronization.