1. Field of the Invention
The invention relates to computer failure handling systems, and more particularly, to a failure handling system that logs failures before a hardware reset occurs.
2. Description of Related Art
The microcomputer world has become a world of interdependence. No longer does a microcomputer system sit on an executive's desk insulated from the outside world. The art has seen the development of massively networked environments in which microcomputers act as both workstations and servers, in which networks connect multiple servers, and in which various telecommunication services connect networks to networks.
A microcomputer that acts as a network server has become particularly important. It demands high reliability, because failures of such a server will typically shut down the network. Such failures will always occur, however, even in the most fault redundant of systems. So when a server fails, the ease of repair then becomes a critical factor. The first, and often most time-consuming, step to repair is diagnosis. The more quickly and easily a technician can diagnose the cause of the failure, the sooner the network will again be on line. When a server fails for a quickly identifiable reason, that server can typically be brought back on line in a relatively short amount of time. A failure for an unknown reason, however, can lead to extensive debugging and trouble-shooting time, leaving the network without its key component.
Therefore, any advances that enhance the ability to diagnose the cause of a computer system failure, especially in a network, would be greatly desirable.
Previous advances over the art dealt with computer failure recovery and alert systems that determined when and whether a computer had in fact failed. Such systems included an automatic system recovery (ASR) timer, which would time out if the operating system did not periodically reload that timer. Under normal operating conditions, the operating system would continuously reload the timer so that it would never time out. When the computer failed, however, the operating system would be unable to reload the ASR timer, so the timer would time out, signalling a system failure and causing a system hardware reset. After the reset and subsequent restart, the computer system would determine the source of the problem as well as it was able, such as by checking for bad memory blocks or by executing diagnostic routines.
Such a computer system is fully described in U.S. patent application Ser. No. 07/955,849 to Burckhartt, filed Oct. 2, 1992 now U.S. Pat. No. 5,390,324 and entitled "Computer Failure Recovery and Alert System." That application, which has been assigned to the assignee of the present application, is hereby incorporated by reference. That application describes the details of a system using an ASR timer.
Such a system, however, is limited in its ability to determine the cause of the system failure leading to a reset. Typically, such a system failure occurs when an application program or the operating system becomes caught in an infinite loop while interrupts are disabled. In such a situation, it is generally impossible to return control to the operating system to terminate the offending application program, or, if the failure is within the operating system itself, such a failure would typically be a catastrophic error and it would be undesirable to continue execution within the failed operating system.
In either case, a hardware reset would occur upon timeout of the ASR timer. On rebooting, the operating system then would have no way of determining where the operating system or application program had become stuck in such an infinite loop. As noted above, this inability to identify the source of the problem could lead to aggravating debugging difficulties for a technician.
Therefore, it would be greatly desirable to provide the capability of logging the cause of a hardware reset resulting from an ASR timer timeout for later diagnostic purposes.