The present invention relates generally to computer operating systems, and more particularly, the present invention relates to a method, system, and storage medium for preventing the recurrence of a system outage in a computer network.
Business enterprise networking systems and related system extensions and upgrades are growing in number and sophistication due, in part, to the increasing popularity of the Internet. Computer manufacturers and computer component manufacturers continuously strive to keep up with the challenges of ensuring consistent and reliable operation of these systems. To this end, error detection and recovery mechanisms have been devised in an effort to prevent malfunctions and system outages. Specific causes of computer system malfunctioning include corruption of memory data, corruption that is related to fixed disks or removable media, operating system errors, component errors, applications or operating systems performing illegal instructions with respect to the processor, and incompatibility between various hardware and software system components, to name a few. Existing solutions have been developed for detecting and reporting errors for subsequent analysis and repair by a system operator or by the system itself.
For example, memory data corruption can be handled by parity detection and/or error correcting code (ECC). Illegal instructions can be trapped by the processor and in some cases handled either within the processor or by the operating system. Other malfunctions may result in system “hangs.” A system is “hanged” when it is no longer able to respond to user inputs and/or is not able to respond to system events including incoming network traffic, etc. Some malfunctions that can result in system hangs include operating systems or hardware components entering unknown or indeterminate states, causing the operating system or hardware component to cease normal operation. In these cases, the computer user must restart the computer. Restarting the computer after a system hang can cause problems such as data loss and corruption. Recent attempts to alleviate data corruption problems occurring under these circumstances include ‘watchdog’ timers that cause a processor to periodically reset a timer which, when the timer reaches a certain value, the computer system is reset. This solution does not cure the malfunction but only resets the computer system. Further, resetting the computer system may result in data loss and corruption as described above. Error checking processors have been developed for detecting and recovering from system hangs; however, they are costly to implement.
Particularly for mid-size and large computer network systems, one notable problem exists when an outage caused by a computer user recurs when the offending user logs back into the system and performs the same operation that caused the failure in the first place and before support personnel are able to perform debug analysis on the prior outage. Debug analysis of a system outage generally includes examination of a system storage dump at the time of the failure. This analysis and repair can take minutes, hours, or even days depending upon the complexity of the networking system and the severity of the error. In the meantime, the system remains exposed to the risks of a duplicate outage occurring as the offending user attempts to gain access to the system and perform the same operation that caused the original outage. What is needed, therefore, is a means to protect a system from multiple outages that result when a user repeats a series of events that had previously tripped an integrity exposure in the operating system resulting in a prior outage.
The above discussed and other drawbacks and deficiencies of the prior art are overcome or alleviated by the duplicate outage prevention tool of the invention.