The identification and tracking of dependencies between components of distributed systems is becoming increasingly important for integrated fault management (problem determination, impact analysis and repair for a set of cooperating components or processes). Distributed systems can be represented as interacting service components, classified into multiple layers, where each layer provides services to layers above. Many service components have dependencies on other service components—such that failures occurring in one service component affect other services and ultimately customer applications. A dependency exists when a first component requires a service performed by another component in order for the first component to execute its functions.
Dependencies can exist between the components of different services on a single system and also between the client and server components of a service distributed across multiple systems and network domains. Typically, dependencies exist between various components of a distributed system, such as end-user services, system services, applications and their logical and physical components.
Relatively frequent failures is a characteristic of most complex operational systems. Recently, attempts have been made to reduce systems' mean time to recovery (MTTR) once failures are detected. In order to reduce MTTR, it is necessary to be able to quickly determine the root cause of a problem that is detected at a higher level, and then to resolve the problem. Many problem determination applications use a component dependency graph to pin-point the root cause.
For example, B. Gruschke, “Integrated Event Management: Event Correlation Using Dependency Graphs”, Proceedings of 9th IFIP/IEEE International workshop on Distributed Systems: Operations and Management (DSOM), 1998, discloses the use of a dependency graph for problem determination, mapping incoming alarms and events to nodes of the graph to identify dependent nodes which are a likely root cause of problems.
However, the discovery and recording of dependency information in a distributed system is a time-consuming and difficult task, since service components generally do not expose dependency information in a standard way. The lack of explicit dependency information makes the tasks of problem determination, isolation and resolution particularly difficult.
It is not acceptable to rely solely on dependency information within configuration files or machine-readable files provided by a software vendor, because the vendor's knowledge of dependencies may be limited and because static information within configuration files cannot provide a picture of dynamic, run-time dependencies. Emerging Web-based architectures allow the composition of applications at runtime and an application running within a Web application server may be instantiated and then terminated within a few seconds.
Furthermore, known techniques for automatically determining dependency information rely on fairly invasive middleware instrumentation or internal instrumentation—such as by embedding code which responds to Application Response Measurement (ARM) API calls (implementing The Open Group's Technical Standard C807) but this requires all components to implement the standard. In typical heterogeneous customer environments, which include a collection of hardware and software from different vendors, known approaches for instrumenting managed applications or objects to directly obtain dependency data are difficult and time consuming to implement, and therefore costly. Such instrumentation approaches may even be unusable in heterogeneous environments and in systems with security, licensing or other technical constraints.
One approach to instrumenting the components of a managed system is disclosed by P. Hasselmeyer in “Managing Dynamic Service Dependencies”, 12th International Workshop on Distributed Systems: Operations and Management (DSOM), France, 2001. Dependencies are made accessible to management applications as properties attached to components. Dependency data is supplied directly by the component having the dependency. The components can be polled or dependency change notifications can be generated at run-time.
A technique for determining dependency information is described by S. Bagchi, G. Kar and J. Hellerstein in “Dependency Analysis in Distributed Systems using Fault Injection: Application to Problem Determination in an e-commerce Environment”, 12th International Workshop on Distributed Systems: Operations and Management (DSOM), France, 2001. The described technique involves injecting a fault into a system during a testing phase and collecting measurements of the external behaviour of the system. The measurements are then analyzed and statistical regression analysis is used to determine the extent to which a quality of service metric depends on specific components. However, fault insertion and other “perturbation” techniques are reliant on the ability to insert controlled perturbations which will be effective in discovering dependencies. Furthermore, typical fault insertion techniques are limited to a testing phase because the insertion of faults is generally unacceptable at run-time.
C. Ensel, “Automated generation of dependency models for service management”, Workshop of the Open View University Association (OVUA), 1999, suggests that generation of service dependency models may be automated using a Neural Network and information collected at run-time. Information such as, for example, CPU usage of an application is taken from lower layers such as the operating system, middleware or transport system. According to Ensel, a time series of objects' activities may then be fed into a Neural Network to judge whether the objects appear to be related. Ensel states that a complex training process is required, but does not describe the training process or how the Neural Network would determine dependencies.