1. Field of the Invention
In general, the present invention provides a method, system and program product for detecting an operational risk of a node. Specifically, the present invention allows an operational risk of a server to be detected based on a performance of the server with respect to other similarly configured servers.
2. Background Art
As the use of computer technology becomes more prevalent, the complexity of computer networks being implemented is increasing. Specifically, many businesses today implement computer networks (e.g., LAN, WAN, VPN, etc.) that utilize numerous servers. The roles of such servers are typical (i.e., perform computations, process requests, serve files, etc.). In many instances, the servers are configured to perform similarly, if not identically for a certain set of parameters. For example, a pool or set of identical servers, typically called a “server farm,” are often used to service high-volume web sites. Similarly, storage servers are often pooled.
Unfortunately, with the extent to which servers have come to be relied upon, degraded performance or even total failure can occur for various reasons. Such reasons include, for example, software malfunctions, hardware errors, etc. Early detection of performance degradation is often vital because an administrator can avoid significant loss of productivity by implementing corrective actions in a timely fashion. Examples of typical correction actions are migration of users or applications from a “problem” server, restarting a software package, rebooting or replacing a server, etc.
To date, the detection of performance degradation has been a static process. Specifically, the performance of each server based on one or more operational aspects (parameters) is monitored and compared to some preset, external level. For example, a processor load on each server can be measured and then compared to an “acceptable” level. If the processor load (e.g., CPU load) of any of the servers is exceeding the acceptable level, an alert can be generated and a corrective action implemented. By basing the detection of possible performance degradation on an external level, however, many problems are presented. For example, the external level might not truly be an accurate indication of “normal” performance. Accordingly, unnecessary alerts and corrective action can be implemented. In many cases, the best way to determine “normal” performance would be to observe how the other similarly configured servers are performing. If all other servers were performing in a similar fashion (e.g., with a similar processing load) without problems, there might not be any reason to implement a corrective action. Unfortunately, no existing solution provides such functionality.
In view of the foregoing, there exists a need for a method, system and program product for detecting an operational risk of a node. Specifically, a need exists to detect an operation risk of a node by comparing the performance of the node to that of other, similarly (or identically) configured nodes. A further need exists for an operational risk to be detected if the performance of one node varies from the performances of the other nodes by more than a current tolerance.