This invention relates to a computer system having a function of preventing an occurrence of a failure and recovering a failure that has occurred, in particular, a method of controlling a failure prevention/recovery process.
Along with developments in computer technology, a computer system has been closely related to various social activities. The use of an advanced computer system has encouraged the spread of online trading and e-commerce in which transactions of a large amount of money and commodities are processed in a short time, which has made the computer system indispensable in finance, distribution, and service industries. The computer system is also used in, for example, plant control at a nuclear electric plant, etc., operation control of an aircraft, a train, etc., and administrative services such as e-government services, thereby playing an important role in a social infrastructure.
As described above, the computer system has become an indispensable element in various social activities. Therefore, once a failure occurs in the computer system (hereinafter, abbreviated as system failure) leading to a function decline, an abnormal operation, and a service suspension in the computer system, it has a tremendous impact on society, resulting in a massive loss of money, collapse of credibility, and social confusion.
Meanwhile, the use of an open system has become widespread. The open system employs various types of hardware and software produced by different manufactures which are combined to construct a computer system mainly for the purpose of a cost reduction in the computer system. The spread of the open system is attributable to the development in the openness of external specifications of software and hardware, along with the performance enhancement and the cost reduction in general hardware components and with the technological advancement of open source software.
The open system allows a user to construct a computer system by combining optimal products in terms of cost, function, or the like. On the other hand, since the computer system is realized by the combination of a plurality of products produced by different manufactures, a trouble may occur due to incompatibilities between the products. Further, the computer system has become more complicated due to the segmentation of function, which makes it more difficult to investigate the cause of the system failure once it occurs.
One example of the above-mentioned computer system includes a Web system. The Web system is used in, for example, e-commerce, for connecting a plurality of suppliers and a plurality of consumers, making it easy to conduct rapid transaction and payment through an electronic banking system in order to deal with a flow of goods that varies from hour to hour. A system failure, which has occurred in the Web system described above, leads to a disruption or the like in the flow of goods or the payment process, wielding a very large influence over the economic activities.
The Web system is typically constituted of a 3-tier system, which includes a Web server, an application server, and a database server. Each of the tiers is constituted of one or more computers. Each of the computers often includes a plurality of types of hardware or software produced by different manufacturers. In the Web system, the Web server accepts a request from a client. The application server performs a process corresponding to the request accepted by the Web server. In performing the process, the application server runs a query and issues a request to the database server, and returns the results obtained with respect to the query and the request to the client through the Web server, to thereby complete the process.
As described above, in the recent computer system, a process corresponding to one request is realized by a combination of a plurality of hardware components and a plurality of software components (hereinafter, collectively referred to as system component). Accordingly, an error/failure occurring in one of the components may affect other system components, which may develop into a system failure.
However, it is difficult to make it all clear how the system components are interacting with one another, because the computer system is realized by the combination of various system components produced by different manufacturers. Further, different processes which are performed in correspondence with a plurality of different requests may affect one another when the processes are performed in parallel. Therefore, it is difficult to identify what kind of error/failure has occurred in what location and the size of the impact of the error/failure. It is also difficult to immediately detect the occurrence of the system failure itself.
In order to solve the above-mentioned problems, there have been proposed various solutions.
JP 2005-216066 A discloses a method of monitoring a system in which a plurality of computer systems operate while interacting with one another, to thereby detect on line an error/failure occurring in the computer system. According to the method, a service to be realized on the computer system by a program is stored in association with a transaction log over the plurality of system components in order to realize the service, and an occurrence of an abnormal pattern in a transaction different from the normal pattern is detected based on a probability model, to thereby detect the occurrence of an error/failure. However, according to the method, it is not possible to detect an error/failure when the transaction pattern exhibits no change.
JP 2005-216066 A also discloses a method of extracting a correlation between monitoring data such as a monitoring value in the computer system and an event such as an input from an operation management tool or a user, retrieving a past correlation based on the monitoring data to extract the event currently being generated, to thereby detect a system failure. However, according to this method, it is not possible to determine how much of effect or side-effect is to be produced by a failure prevention process or a failure recovery process. It is not possible either to store, into the correlation information, information for determining the effect or side-effect.
JP 2005-38223 A discloses a method of detecting an operating status of a computer system, retrieving a plurality of rules corresponding to the current operating status of the computer system from among conditional expressions described as rules, comparing the effects to be obtained by each of the plurality of rules thus retrieved with respect to the countermeasures to deal with a certain problem, the countermeasures being described in the plurality of rules, selecting a countermeasure having the highest effect and executing the countermeasure thus selected, to thereby update the effects described in the rules based on the actual effect obtained by the execution of the countermeasure. According to this method, it is not possible to perform a failure prevention process or a recovery process that is not described in the rules. Also, an update which is to be made with respect to the effect obtained by the execution of the countermeasure or the priority of the process matters is performed only to the indexes included in the conditional expressions described in the rules, with no consideration being given to any other effects that might to be brought about to other indexes.
A conceivable example of a control method for failure prevention/recovery in a computer system, which can be attained by combining the solutions disclosed in the above-mentioned JP 2005-216066 A and JP 2005-38223 A, may be a control method of detecting an event being generated in the computer system based on a log of a plurality of computer systems, with reference to a probability model or a correlation history, preventing and recovering the failure in the computer system according to the set rules, to thereby update the rules based on the actual effect obtained by the execution of the rules.