1. Field of the Invention
The present invention generally relates to data processors having error detection capability and, more particularly, to the provision of automatic recovery from transient errors, including error correction, in data processors generally, including mainframes as well as data processors suitable for use in personal and notebook computers and work stations.
2. Description of the Prior Art
Due to the sheer numbers of components present in computers and data processors, the possibility of errors arising in signals representing either data or control commands, including processor status data, has been of great concern to designers of computer systems. An undetected error in processing of data can result in propagation of erroneous data each time a further operation is performed on it or any processed data derived from the erroneous signal. An error in a command or status code can result in even more rapid propagation of both corrupted data and corruption of good data by incorrect processing. Efforts to minimize or contain the effects of malfunctions have included efforts to reduce their frequency of occurrence, failure detection techniques and computer designs which allow the computer to continue operation after a transient failure has occurred or even while a failure is continuing. While the frequency of failures has been reduced by several orders of magnitude over the last few decades, the miniaturization of many components such as dynamic random access memory (DRAM) cells, which can be discharged or erased by energetic particles, continues to make error detection and recovery a major concern to computer system designers.
A variety of techniques in logic design have been used to deal with data processor failures and the problems which may be engendered thereby. An early technique for detection of dropped bits was the provision of parity coding of data in which a bit was set at logical "1" or "0" in dependence on the number of bits in a data string of fixed length (e.g. a "byte") which were of a particular logical value.
Another technique which is currently familiar in correcting data errors (but not control code errors due to the number of bits required) is called error correcting codes (ECC) which uses a plurality of additional bits in order to allow determination of the particular bit in a code segment which has erroneously changed. Essentially, error correcting codes use a parity bit for each of a plurality of different groups of bits in the code segment or byte so that a particular bit location can be isolated and corrected. These kinds of coding allow error detection and make it possible for processor operation to continue even while the failure exists since the data can be corrected. However, in view of the amount of storage which may be required for parity bits and error correction codes and well as the circuitry to implement error detection and correction within the processor, full utilization of these techniques has generally been limited to mainframe computers.
Conversely, in personal computers, at most, only parity checking is usually provided because of the cost of additional logic to implement error correction. The increased reliability of components has allowed personal computers and work stations to use processors which have no error checking facility at all. It has been considered sufficient in these systems to provide back-up storage arrangements in software to limit the amount of data lost when a malfunction results in loss of or the need to discard a corrupted file. Utility applications are also known which may allow reconstruction of a corrupted file. However, very recently, to increase the processing power available to a plurality of users, it has become the practice to interconnect many personal computers and/or terminals with a network. This approach also limits the costs resulting from obsolescence of data processing equipment since individual units can often be replaced on a piecemeal basis.
However, in such a network application, the problems which may be caused by a malfunction or transient error may approach the gravity of a malfunction in a mainframe computer while fewer, if any, precautions have been taken to detect errors and none may be available for error recovery. On the other hand, such systems and other multi-processor environments do present the recovery possibility of isolating or disconnecting a malfunctioning processor and continuing processing with another processor. Such arrangements are known in the art as alternate processor recovery but require the cooperation of the operating system in order to be implemented.
Even in mainframe computers, the provision of error detection and error recovery has been costly in terms of both hardware and software expense and processor performance. By some estimates, as much as 30% of the data processing circuits in a mainframe computer may be used for error checking. These circuits take up space in the central processing unit, making it physically larger with longer signal paths and consequent limitations on cycle times. Some error check circuits are undoubtedly present, possibly serially, in some so-called critical paths within the processor and may limit cycle times further. The existence of these error detection and correction circuits also make central processing units more difficult to design and to debug. Nevertheless, these costs have been considered justified because an entire business or enterprise may rely on the operation of a computer.
It has been long recognized that a highly effective error checking technique could be accomplished by running two identical processors in parallel, performing the identical operation on identical data and comparing the results. Extension of the technique to three processors can also allow error recovery by choosing a coincident result of two of the three processors as the correct result. However, the cost of processors has historically limited such triple redundancy arrangements to applications where high reliability was of extreme importance and technical intervention was limited or impossible, such as in space vehicles. Redundant processors, however, continue to be used as an error detection technique, particularly as the cost of processors has decreased, although the use of two processors does not inherently provide for error recovery.
In addition to the parity checking (which was originally inapplicable to control codes but has recently been extended thereto) and error correcting code techniques applied to data as mentioned above, check circuits have been provided to detect certain illogical combinations of operations and/or control codes. However, no comprehensive or systematic technique for doing so has been devised in view of the flexibility for combinations of operations and controls which must be maintained in computers. Further, circuitry adapted to sense particular conditions or related types of conditions often require significant space and numerous such circuits are often required for differing types of conditions.
The concept of a "retry" mechanism has been used for error recovery in which sufficient status information is retained by the processor or its peripheral memory to allow the processor to return to an earlier point and resume operation from that point in the event of a failure. This allows the processor to continue operation even when a transient failure occurs or even in limited circumstances when a permanent failure occurs. The retention of data and status information also allows the "resumption" of processing to be transferred to another processor, if available, as may be possible in redundant processor systems and networks. However, retry mechanisms have also required substantial space and can require significant structure to maintain synchronism with the remainder of the processing system because of the time required for retry to occur.
Therefore, in summary, it is seen that many different approaches to error detection and recovery have been attempted but all have required substantial costs in terms of design time, hardware and processor performance and which have led to omission of such provisions in small computers. However, as networking of small computers and work stations has led to larger installations, the importance of integrity and reliability of the installations has become greatly increased if not critical to users and the enterprises which rely on them. It is therefore clear that inclusion of error checking and correction may shortly become a requisite feature of even the smallest of computers. Further, there has heretofore been no integration of the capabilities of the above mentioned techniques, particularly at small size and without severe processor design complications.