The present invention relates generally to system management, and more particularly to a system for managing faults in a distributed system.
A distributed system is difficult to manage due to complicated and dynamic component interdependencies. Managers are used in a distributed system and are responsible for obtaining information about the activities and current state of components within the system, making decisions according to an overall management policy, and performing control actions to change the behavior of the components. Generally, managers perform five functions within a distributed system, namely configuration, performance, accounting, security, and fault management.
None of these five functions are particularly suitable for diagnosing faults occurring in complex distributed systems. Diagnosing faults using manual management is time consuming and requires intimate knowledge of the distributed system. In other management techniques such as SNMP, the diagnosis of faults is difficult to obtain because relationships between components within the distributed system are not easily ascertained. Since relationships are hard to ascertain, it is difficult to determine causes and effects, and thus diagnose faults. Other approaches that have been used to diagnose faults are with conventional expert systems. However, conventional expert systems are too fragile since their rules are inapplicable for changes occurring in the configuration of the distributed system. In addition, the conventional expert system is too general to enable autonomous control. For example, when an expert system attempts to analyze a distributed application, the expert system is aggravated because the distributed system is dynamic. Every time a process starts up, it has a unique identification number that changes with each execution. Therefore, the rules in the expert system will no longer apply. Also, it is difficult to isolate faults in a distributed environment because a resource limitation on one system may cause a performance degradation in another system, which is not apparent unless one is very familiar with the architecture of the distributed application and how the components work together.