1. Field of the Invention
This invention relates to computer systems and software applications. Particularly, this invention relates to computer systems and methods for managing and recovering from errors occurring in the operation of computer systems and software applications.
2. Description of the Related Art
Operating systems and applications occasionally encounter errors that can result in a software outage, entailing the inability to perform work for a period of time and sometimes resulting in lost or corrupted data. Software applications outages can be expensive for users and result in reduced customer satisfaction. When an error occurs that makes the system unable to perform work adequately, the typical practice is to restart the system or application, i.e. the user voluntarily incurs an outage. However, this is not a true solution, but merely a way to end the present error condition and usually the only remaining option to address the situation.
In addition, when such operating system or software application problems are detected, it is desirable to have a way to relieve the problem symptoms with minimal disruption. However, fixing the root cause of a problem that affects a mission-critical software application often takes a very long time. Because it can take weeks or months to determine the root cause and then more weeks or even months to develop and test corrective maintenance when a field problem with a software application occurs, people who manage mission-critical systems and applications live in fear of the type of problem that happens without warning and requires a restart because the vendor cannot quickly fix the problem and cannot provide an adequate circumvention. In the meantime, business-critical systems and application must continue to operate. Accordingly, a quick transition back to normal operation has high value.
In addition, although many modem sophisticated programming systems include error recovery functions, these functions are designed before the system is ever used. But, many unanticipated errors often occur in an end user environment. Thus, these error recovery functions do not accommodate many types of errors. In addition, enhancing the recovery routines of these systems often requires a full software development cycle. Time is a factor in dealing with software failures.
In view of the foregoing, there is a need in the art for systems and methods to deal with problems that lead to software outages. There is further a need for such systems and methods to reduce the likelihood of data loss or corruption resulting from a software outage. There is still further a need for such systems and methods to reduce the time that critical software is unavailable due to such errors. As detailed hereafter, these and other needs are met by the present invention as detailed hereafter.