The occurrence of random errors in data processing systems has a substantial adverse effect on the reliability of a system. If a random error, i.e., one which is not due to a continuing fault in the system, occurs near the end of the execution of a lengthy program, it is necessary to restart the program and run it again from the very beginning. Because this was often an unacceptable solution, programs were segmented into distinct parts, and the occurrence of an error in one of the parts required that the program be rerun only from the beginning of the part in which the failure occurred. Even this approach became inadequate as computing speed increased and data processing systems became more complex and time dependent.
One of the aspects which complicates error recovery techniques is the use of memory cache systems. The speed of processors has increased to the point where economically feasible main memory systems are too slow to keep up. To improve memory performance without the cost of having all memory circuits match the speed of the processor, intermediate, high speed, storage systems have been introduced. Such systems, called caches, operate at close to the speed of the processor and have very limited storage capacity. Because of the limited capacity, it is necessary to carefully allocate the capacity to the data which is most likely to be used by the processor.
Various techniques have been developed to allocate the cache and control the data which is stored therein in accordance with general requirements for data processing systems. Cache systems can be classed as store-through systems, in which the revised data placed in cache is also placed in the main storage unit; as store-in cache systems, in which the data placed in cache is not placed in main storage until it is cast out and a cache miss requires the data to be read from storage to cache from which it is accessible to the processor; and hybrid systems which incorporate some of the aspects of each. In some cases, specialized adaptations have been made to accommodate particular requirements for more advanced systems. In particular, attempts have been made to accommodate the requirements for data processing systems which utilize more than one processing unit. While these approaches have been effective from the standpoint of improving cache efficiency, they have not enhanced, and in some cases even reduced, the ability of the system to recover from a soft error. Since improvement in system performance in one area which may be illusory if it is accompanied by a degradation of performance in another area, there has been a continuing need to integrate the cache management and operation with the error recovery techniques in multiple processor systems, in particular, that class of error recovery termed checkpoint retry.
While so-called multiprocessor systems include multiple processing capability, and to this extent may have discrete processors which may be assigned to different tasks or to different portions of the same task to achieve a parallel processing capability, all elements of the system are not replicated for each processor. For example, since the different processors may be required to operate on common data, each processor must have access to a common storage unit. For the same reason, each processor must be able to reach various input/output devices. Little used features of the system, or those which can be shared without substantial loss of performance, are usually not replicated for each processor. An example of this type of feature is the error recovery facility. Random errors are not likely to occur in more than one processor at a time. Since the recovery from a random error does not require a substantial period of time, the occurrence and recovery can be completed before an error occurs in another of the processors and a single error recovery facility is sufficient.
All of this would be quite simple if all the processors were the same type, particularly with regard to the type of cache used. Existing error recovery techniques work quite well with multiprocessing systems in which the processors utilize store-in cache. Typically, the cache data is saved in a backup array before a cache write operation. In a checkpoint retry operation, the cache data at the retry point is restored from the backup array. The drawbacks with this approach include the fact that the writing to cache is delayed by the necessity for writing the cache data in the backup array prior to any write to cache operation. Additionally, and perhaps more significant, in a multiprocessor system, a cache locking mechanism must be provided for exclusive status of cache data during a checkpoint period.
Other forms of existing error recovery techniques are satisfactory for use with multiprocessing systems in which the processors utilize store-through cache. However, data processing systems are commonly designed as a "family" and can be configured with various types of storage, I/O devices and even different types of processors. For example, one "member" of the family can use a multiprocessing system in which processors use store-in cache. Other "members" of the same family can be configured to use processors which incorporate store-through cache. Regardless of the election to use store-in or store-through cache, it is desirable from the standpoints of cost, manufacturing, compatibility, and maintenance to have the other elements of the system be the same throughout the entire family.
It is desirable to have a family of data processing systems which incorporates models having processors with store-in cache, as well as models with processors utilizing store-through cache. For the reasons mentioned above, it is desirable to have a single error recovery facility for the entire family.
Recovery from an error in a data processing system requires that the system be restarted from a point in the instruction sequence which preceded the error, and therefore contains data which is not affected by the error. Ideally, this would be the instruction which immediately preceded the instruction in which the error occurred. This would require the system to store and retain all the relevant data for each instruction, an intolerable overhead burden. In practice, the overhead is reduced by retaining the data only at selected points which may be spaced several instructions apart. The data saved at the selected point will usually consist primarily of the program status word (PSW), the content of the floating point registers (FPR's), the access registers (AR's), and the general purpose registers (GPR's).
Another troublesome problem with cache systems involves the ultimate matching to the speed of the main memory. The high speed cache satisfies the need for data at a speed comparable to that of the processing unit. The high speed cache does not solve the problem of reading and writing data to memory at relatively slow speeds. The prior art solution has been to provide a store buffer, which can hold a single line of cache data, between the cache and main storage. By transferring data to and from memory through the line buffer, the cache is freed up and can work at a higher speed.