Applications may terminate due to any number of threats, program errors, software faults, attacks, or any other suitable software failure. Computer viruses, worms, trojans, hackers, key recovery attacks, malicious executables, probes, etc. are a constant menace to users of computers connected to public computer networks (such as the Internet) and/or private networks (such as corporate computer networks). In response to these threats, many computers are protected by antivirus software and firewalls. However, these preventative measures are not always adequate. For example, many services must maintain a high availability when faced by remote attacks, high-volume events (such as fast-spreading worms like Slammer and Blaster), or simple application-level denial of service (DoS) attacks.
Aside from these threats, applications generally contain errors during operation, which typically result from programmer error. Regardless of whether an application is attacked by one of the above-mentioned threats or contains errors during operation, these software faults and failures result in illegal memory access errors, division by zero errors, buffer overflows attacks, etc. These errors cause an application to terminate its execution or “crash.”
Solutions have been proposed, for example, that implement proactive approaches, which seek to make the code as dependable as possible through the use of safe languages, libraries and compilers, code analysis tools, and development methodologies. Debugging aids that attempt to make post-fault analysis and recovery as easy as possible for the programmer have also been proposed. Byzantine fault tolerance schemes have also been proposed that use voting among a number of service instances to select the correct answer. However, these fault-tolerance schemes operate under the assumption that only a minority of the replicas will exhibit faulty behavior. In fact, many of these approaches to solving this problem are generally proactive, but these strategies do not result in error-free code. These solutions typically exhibit problems, such as reduced system performance, monotonous and bothersome user interaction, and self-induced denial of service (i.e., when an overflow is detected, the only alternative is to terminate the application). In addition, with regard to server applications, server applications often cannot be simply restarted because they are typically long running (accumulate a fair amount of state) and usually contain a number of threads that service many remote users. Restarting the server denies service to other users. As a result, software remains notoriously buggy and crash-prone. Moreover, these solutions are inappropriate for high performance, high availability environments, such as a frequently-visited e-commerce web server.
In addition, these applications may be installed on a number of platforms, such as a personal digital assistant (PDA), a cellular telephone, or an automobile personal computer. For example, an open platform operating system has been used on automobile personal computers to allow users to install third-party applications that have been designed for the platform. These applications are also vulnerable to software failures. While antivirus programs are currently being developed for these platforms to protect the applications from such failures, they often require user interaction (e.g., downloading a patch or another application, connecting the device to a personal computer, etc.) and reduce system performance by taking up the platform's already limited space, memory, and transmission bandwidth.
Therefore, there is a need in the art for methods and systems for providing a more reactive and automated approach for handling a variety of software failures such that application can recover from such failures without requiring user intervention and reducing system performance.
Accordingly, it is desirable to provide methods and systems that overcome these and other deficiencies of the prior art.