A system, method and computer program product are provided for monitoring and maintaining a distributed computing system.
Distributed computing systems include multiple hosts (e.g., application servers) that execute one or more common applications for access by remote users. Whether the hosts are separate computer servers, individual virtual machines or some other combination of distinct hardware and/or software resources, managing the hosts and the applications can be difficult.
For example, when the number of hosts reach into the hundreds, the potential for problems increases commensurately. Such problems may involve diminishing resources (e.g., storage space, memory, communication bandwidth), conflicts between different processes for hardware and/or software resources, etc.
Simple monitoring tools generally allow monitoring of individual resources (e.g., disk space), but often do not support monitoring of specific application-level processes or activity, and especially not across tens or hundreds of hosts. Further, the information reported by such tools is generally limited to specific resource statuses, and does not provide a glimpse into the state or status of the overall distributed system. Therefore, if a resource constraint detected by a traditional monitoring tool is actually being caused by some other condition, the tool may not be able to recognize that, and may not provide enough information to allow a human operator or administrator to determine the underlying problem.
Yet further, traditional monitoring tools stop at monitoring and collecting information. They do not attempt to intelligently apply possible solutions to correct a problem. Other tools that may be capable of taking remedial action in some circumstances generally do so in a “dumb” manner—that is, they apply the same action every time a particular circumstance is encountered. Even if the specified action has no effect, but some other action would (or might) solve the problem, the same ineffective process will be applied the next time.