1. Technical Field
The present application relates to a distributed system, server computer, distributed management server, and failure prevention method.
2. Background Art
A failure often occurs in a business system due to a defect in a business application program implemented on an application server, causing the entire business system to fail. Examples of a failure in a business application include the following.
A first example is a deadlock. A deadlock refers to a state in which multiple threads which are executing application logic exert exclusive control (acquisition of lock, etc.) over each other and thus the execution of each thread remains blocked. In this case, there occur problems, such as one in which a client application serving as the caller of an application, such as a browser, cannot obtain a response to a request, failing to update the screen.
A second example is excessive memory consumption. Excessive memory consumption refers to a state in which the memory area available to an application server is reduced due to one-time execution of application logic which handles a great amount of data in query processing related to business, for example. In this state, a delay may occur in processing of a thread which is executing another application logic which is running on the same process in parallel, or an error may occur due to a memory shortage during processing. At worst, the process of the application server aborts.
A third example is excessive central processing unit (CPU) consumption (excessively high CPU usage). Excessive CPU consumption refers to a state in which the CPU is being used more than necessary due to occurrence of an infinite loop, or redundant application logic. In this case, there arises, for example, a problem that the caller of the application cannot obtain a response to a request, failing to update the screen. There also arises a problem that a delay occurs in processing of a thread which is executing another application logic which is running on the same process in parallel.
When such an event occurs, it usually can be eliminated by restarting the failed application server. In any case, however, a fundamental solution requires modification of the application program in question. For this reason, the system operator may have to take temporary measures, such as periodical restart of the application server or tuning of the parameters with respect to the application server, so as to prevent recurrence of a failure as described above until the application is modified.
Japanese Unexamined Patent Application Publication No. 2010-9127 discloses, as a technology for handling a failure, a management apparatus that identifies a component acting as the cause of a failure. Specifically, when an abnormality occurs in a system in which a computer including an application server and a computer including a management system are connected together via a network, the management apparatus can determine in which of the application server and the management system the failure has occurred.
Japanese Unexamined Patent Application Publication No. 2006-338069 discloses a component software operation infrastructure that prevents an error from occurring in response to a sign or occurrence of a failure, as well as prevents a reduction in performance. Specifically, during execution of component software including multiple components (software components), the component software operation infrastructure replaces a component with another.
Japanese Unexamined Patent Application Publication No. 2005-209029 discloses an application management system including multiple application server computers and a management server computer. Each application server computer refers to an application operation definition storage file stored in the management server computer to control and manage the execution of an application whose execution has been requested. When a failure occurs in an application being executed by an application server computer, the application server computer notifies the other application server computers or external computers of that fact.
Currently, systematization using cloud computing technology is becoming widespread. Such systematization refers to constructing a system using a great number of application servers distributed on many networks. In such an environment, the load is distributed. Accordingly, an abnormality occurring in a single server (due, for example, to occurrence of a failure) is less likely to cause the entire system to fail.
However, if the failure is caused by a problem associated with the application program itself, all the application servers have a potential risk of experiencing the same failure in future. For this reason, the system operator must inevitably take a temporary measure as described above, unless he or she takes a measure, such as downgrading of the failed application to a previous version. Further, as the number of servers increases, such a failure is more likely to occur in the system. This results in an increase in the operating cost for taking temporary measures.
Any of Japanese Unexamined Patent Application Publication Nos. 2010-9127 and 2006-338069 does not consider a situation in which multiple servers, such as application servers, are disposed as in cloud computing.
For an application management system according to Japanese Unexamined Patent Application Publication No. 2005-209029, when an abnormality occurs in an application which is being executed by an application server computer, it cannot prevent a similar abnormality from occurring in another application server computer.