A performance problem in a system component usually appears to the system operator as a poor response time for an application running in the system. This is the case because the application typically depends on many resources in the system for its execution, including memory, storage switches, disk drives, networks, etc. For any one application there may be hundreds of different resources with the potential to cause performance problems by being unable to satisfy the demand. Over a whole system there may be many thousands of such interrelated entities.
Currently, programs may be set up to monitor the performance of the components separately. The results are gathered into a central tool for the operator's examination. A disadvantage of this approach is that it relies on the operator's understanding and experience as to how measurements and events from different components are related. With the scale of computer systems continuing to grow, it is very difficult for the operator to manage the performance of the systems and identify system problems accurately. Furthermore, the information for each component must generally be quantized into one of a few possible states.
U.S. Patent application No. 2002/0083371A1 describes a method for monitoring performance of a network which includes storing topology and logical relation information. The method attempts to help a user in determining the main causes of network problems. A drawback of this approach is that it limits dependencies between components to the physical and logical topology of the system, where the logical topology is a subset of the physical topology. In a networked storage system there might be performance dependencies between components which are not directly connected in the physical topology. Another drawback of this method is that although the user may “drill down” to the source of a problem by navigating through “bad” states of the components, the observed problem and the actual cause might be connected through a chain of entities that are themselves not in a bad state.
U.S. Pat. No. 6,393,386 describes a system for monitoring complex distributed systems. The monitoring system builds a dynamic model of the monitored system and uses changes in the state of the monitored components to determine the entities affected by a change. It correlates events which might have been caused by the same problem to identify the source of the problem. As such, the user of the system does not have the ability to investigate conditions which the system has not recognized as faulty or degraded conditions. In addition, the system searches for reasons for a particular degradation or failure of a node in the system. Since only the nodes directly connected to the affected node are considered, this approach might lead to incomplete analysis if the system model does not completely specify all relationships between the entities.
U.S. Pat. No. 5,528,516 describes an apparatus and method correlating events and reporting problems in a system of managed components. The invention includes a process of creating a causality matrix relating to observable symptoms that are likely the problems. This process reduces the causality matrix into a minimal codebook, monitors the observable symptoms, and identify the problems by comparing the observable symptoms against the minimal codebook using various best-fit approaches. However, in a complex networked storage system, there might be several causes for a single observed problem which requires different approaches to identify these causes. In such a situation, a solution implemented by a completely automated sub-system, as described in U.S. Pat. No. 5,528,516, might not be the ideal one for the user.
Therefore, there remains a need for a system and method for managing the performance of a computer system that help the operator effectively track the performance of individual components and accurately identify problems affecting the system performance.