Known distributed computer systems typically include multiple computers such as application servers (including e-commerce servers and other web servers, database servers, etc.) firewalls, routers and/or switches. A web server interfaces to client computers via the Internet to provide some type of service to the client computers. An e-commerce server is a web server that enables advertising, information about products and sale of products via the web. Other types of application servers interface to client computers via some type of network to make the respective applications available to the client computers. Often times, a web server or other type of application server accesses a database server to obtain data such as web pages needed by the client computers. A firewall is typically connected between a server and the Internet to filter out unwanted messages, such as spam, viruses, worms, etc., attempting to enter into or exit from a network containing the server.
A failure in one computer may impact other computers in the distributed computer system. For example, if a user of a client computer cannot utilize a web application hosted by a web server, the problem can be in the firewall which separates the web server from the Internet, the web application server itself, the web server operating system, micro code or hardware, a database server used by the web server to obtain data needed by the client computer, or within a sub module of the web application server. When a failure of unknown origin occurs, it was known to reboot all of the computers and their software involved in providing the service to the client computer, one-by-one, to attempt to fix the problem. It was also known to reboot the computers and their software in an order determined by an administrator, from the most likely cause of the problem to the least likely cause of the problem. It was also known to reboot the computers and software in an order determined by an administrator, from the easiest/fastest computer and its software to reboot to the most difficult/slowest computer and its software to reboot.
It was also know to perform “micro-reboots” of separate applications of a server, as well as entire computers and other hardware devices such as routers, switches and firewalls.
A document entitled “Improving Availability with Recursive Microreboots: A Soft-State System Case Study”, by George Candea, James Cutler, and Armando Fox, published by Stanford University in 2004 discloses capturing system information in an f-map, which has system components as nodes and fault-propagation paths as edges. Two phases are then used for analyzing system information and preparing a recovery map. During the first phase, a map of interactions between components is drafted, based on injecting faults into an operational system and determining the outcome. During the second phase, the system observes naturally occurring faults and the reaction of the system to them, creating a map of the impact of recovery events as observed in the system.
An object of the present invention is to reboot computers and other components of a distributed computer system in an optimum order to expeditiously identify and fix a problem component in the distributed computer system.