The identification and tracking of dependencies between the components of distributed systems is becoming increasingly important for integrated fault management. Applications, services and their components rely on a variety of supporting services that might be outsourced to a service provider. Moreover, emerging web-based (world wide web-based) business architectures allow the composition of web-based e-business (electronic business) applications at runtime.
It is to be understood that the term “runtime” generally refers to the time period when a piece of software is being executed and active in a computer system's memory, as opposed to being dormant and merely sitting in storage on a computer's hard drive. Thus, being able to compose e-business applications at runtime means having the capability to do so without the need to bring down and restart the system/application and without the need to recompile the application. Traditionally, the lifecycle of a computer program is: write program code->compile (translate into machine code)->run. Thus, with the above capability, one can assemble several pieces of software to form a new application “on-the-fly,” i.e., without the need to bring down/compile/restart the application.
Consequently, however, failures occurring in one service affect other services being offered to a customer, i.e., services have dependencies on other services. Dependencies exist between the components of different services on a single system and also between the client and server components of a service across multiple systems and domains. Herein, services that depend on other services are referred to as dependents, while services on which other services depend are referred to as antecedents.
It is important to note that a service often plays both roles (e.g., a name service is required by many applications and services but depends, itself, on the proper functioning of other services, such as the operating system and the network protocols and infrastructure). Furthermore, dependency relationships are transitive, i.e., the dependent of a given component requires, in addition to the component itself, the components' antecedent(s).
Dependencies exist between various components of a distributed system, such as end-user services, system services, applications and their logical and physical components. However, service dependencies are not made explicit in today's systems, thus making the task of problem determination, isolation and resolution particularly difficult.
Existing art in the area of software development (such as U.S. Pat. No. 4,751,635 and U.S. Pat. No. 5,960,196), maintenance (such as U.S. Pat. No. 5,493,682) and software packaging (such as U.S. Pat. No. 5,835,777) deal with individual software elements and modules that form the atomic parts of a program package and require the availability of program source code in order to build software and bundle it into software products. Source code is available to the software developer and not to the service user. The invention primarily focuses on software products that are already packaged.
The Institute of Electrical and Electronics Engineers Standard 1387.2 (entitled “Portable Operating System Interface (POSIX) system administration, part 2: Software Administration,” IEEE, 1995) addresses software distribution/deployment/installation. The IEEE standard defines a mechanism for ensuring that new software components (which are going to be installed) do not conflict with an already existing software installation. The IEEE standard identifies three kinds of relationshi: prerequisite, exrequisite, corequisite, that facilitate such compatibility checks. This is done individually for every system on which new software needs to be installed. With the IEEE standard, the software inventories present on other systems are not taken into account. Furthermore, the IEEE standard does not deal with instantiated applications and services and therefore does not represent any means of determining the dependencies between components at runtime.
Open Group (Systems Management: Distributed Software Administration, CAE Specification C701, The Open Group, January 1998) extends IEEE 1387.2 by defining several commands (swinstall, swlist, swmodify, etc.) that are invoked by software installation tools on a specific system. Open Group also defines a software definition file format to make sure that the information required by the aforementioned commands is available from the system on which the commands are invoked. The shortcomings of IEEE 1387.2 (i.e., confined to a single isolated system, no means for determining software dependencies at runtime) also apply to the Open Group specification.
Current Operating System Inventory implementations (such as the IBM AIX Object Data Manager (ODM), the Linux Red Hat Package Manager (RPM) or the Microsoft Windows Registry) follow either the OpenGroup specification and the IEEE 1387.2 standard or describe the software inventory in a proprietary format. Thus, the aforementioned limitations also apply to such Current Operating System Inventory implementations.
Techniques for electronic software distribution of whole program packages (such as U.S. Pat. No. 6,009,525 and U.S. Pat. No. 5,721,824) or updates/corrections/fixes/patches (such as U.S. Pat. No. 5,999,740, U.S. Pat. No. 5,805,891, and U.S. Pat. No. 5,953,533) are, by definition, restricted to the distribution/deployment/installation of (one or many at a time) physical software packages and do not take the runtime stages of applications into account. In addition, they deal with one system at a time and do not take the cross-system aspects of applications and services into account.
Techniques for determining conflicts in existing software/hardware configurations (such as U.S. Pat. No. 5,867,714) are also confined to a single system and do not take runtime aspects into account.
While existing work (such as U.S. Pat. No. 5,917,831), often within the scope of event correlation (see, e.g., Gruschke et al., “Integrated Event Management: Event Correlation Using Dependency Graphs, DSOM '98, 1998 and Kätker et al., “Fault Isolation and Event Correlation for Integrated Fault Management, IM '97, 1997), has focused on identifying and describing service dependencies in a proprietary format, it has remained unclear how dependency information can actually be exchanged between different entities of the fault management process. Since it is unlikely that the different parties involved in the fault management process of outsourced applications use the same toolset for tracking dependencies, it is of fundamental importance to define an open format for specifying and exchanging dependency information.
Also, due to the heterogeneity associated with components of the distributed system with which the fault management process is involved, determining the root cause of a system failure (e.g., service outage) is extremely difficult, given the limitations of existing techniques.
To sum up, a few techniques relating to the determination of relationships between software products have been described and implemented in the existing art. These existing techniques suffer from one or more of the following shortcomings:
(a) they address only the installation and deployment phases of a software product; i.e., they do not attempt to capture the design and runtime aspects;
(b) they do not deal with end-to-end applications and services that span multiple systems; i.e., they address the characteristics of software residing on a single, isolated system;
(c) software inventory information is described in a proprietary format that makes it extremely difficult to share this information among various heterogeneous systems; and
(d) they do not effectively identify the root cause of a service outage.