This invention relates generally to computer systems and more particularly to a method and apparatus for prioritizing and handling hardware errors in a computer system.
In recent years, computer systems have progressively become larger and more complex. The larger a computer system is, the more components it contains, and the more components there are, the greater the chances of hardware failure. As a result, for very large and complex computer systems, hardware failures are practically inevitable. Since hardware failure is almost a given, the important issue in large-scale computer systems becomes the manner in which hardware failures or errors are handled.
Hardware failures fall into several different categories. A first category is that of correctable failure. For this type of failure, operation of the computer system need not be immediately interrupted since the error can be corrected. A second category is that of non-correctable error. With this type of failure, system operation is immediately interrupted in order to prevent the system from using corrupted data or executing a corrupted instruction. This type of hardware failure typically causes the system to re-execute an instruction or to repeat a particular process. A third type of hardware failure is one in which there is no possibility of recovery. With this type of failure, the system needs to be shut down and restarted. As can be seen from this discussion, the different categories of hardware failures require different handling, In order to maximize system efficiency, hardware failures should be prioritized and handled accordingly. Currently, however, there is no system believed to be available which carries out this function satisfactorily and efficiently.
In accordance with the present invention, there is provided a computer system wherein hardware failures are efficiently prioritized and handled. In the preferred embodiment, the computer system comprises a central processing unit (CPU), at least one cache, and a memory management unit (MMU) wherein a plurality of low priority and high priority error queues are maintained. Each queue is associated with a selected unit of the MMU. Whenever a low priority error (e.g. a correctable error) is detected in one of the MMU units, an entry is loaded into the low priority queue associated with that MMU unit. Once loaded with an entry, the low priority queue sends out a control signal indicating that a low priority error has occurred. In response, the MMU sends an interrupt request signal to the CPU. Depending on the level of the interrupt request (which may be set by a user) and the status of a mask register within the CPU (which may also be set by a user), the interrupt may either be serviced by the CPU or it may be ignored for the time being. Regardless of which action is taken by the CPU, system operation continues because the error is correctable. Primarily, entries in the low priority error queues are used for purposes of logging the hardware failure for subsequent analysis.
On the other hand, if a high priority error (e.g. a non-correctable error) is encountered by one of the MMU units, then an entry is loaded into the high priority error queue associated with that MMU unit. Once that is done, the high priority queue sends out a control signal indicating that a non-correctable error has been detected. In response, the MMU sends a RED ALERT control signal to the CPU to cause the CPU to give immediate attention to the error. Thus, a non-correctable error is given much higher priority than a correctable error. In general, non-correctable errors may cause termination of the currently executing instruction or program but it usually does not necessitate halting the whole system.
Finally, it may be possible that one or more of the high priority error queues may overflow, thereby indicating that more non-correctable errors have been detected than the system can handle. If this happens, then one or more of the high priority queues will issue an overflow signal. In response to this overflow signal, the MMU will issue a control signal to stop the system clock. This serves to freeze the system at the current state. Thereafter, the contents of the system are scanned out to ascertain the internal states of the system. This process is preferably carried out only when it becomes clear that recovery from non-correctable errors or failures is not possible, i.e. when one or more of the high priority queues overflows.
As shown by the above discussion, the present invention prioritizes hardware failures based on the type of hardware error. In addition, each type of failure is handled in an efficient manner suitable for the type of error. Overall, the present invention provides an efficient and effective means for prioritizing and handling hardware failures.