The present invention is generally directed to diagnosing software problems in data processing systems. More particularly, the present invention is particularly directed to a system and method for consistent and extendable management of diagnostic probes. Even more particularly the present invention is directed to a system of independently constructable diagnostic probes. And even more particularly, the present invention is directed to the construction and utilization of diagnostic software probes which are capable of diagnosing problems within a plurality of system software levels. Moreover, the present invention is particularly useful in clustered data processing systems which generally possess more complex software in a distributed hierarchical arrangement.
In the context of the present invention, a diagnostic probe is a relatively small, stand-alone program that provides direct diagnostic functionality for a specific software or hardware component in a data processing system. Each probe is capable of codifying a specific part of an expert's debugging knowledge.
In cluster systems management software, many components and daemons run on many machines (nodes) and these components are designed so that normally they are communicating properly and using correct data in order for the cluster to run properly. However, it is difficult to guarantee that all of these various components can automatically recover from communication and data integrity problems. Thus, there may be times when some portion of the cluster stops functioning properly. Making the problem worse, it is usually very difficult for the customer to diagnose the root cause of these problems because of the complexity of the components and the various interactions which are designed into the system to insure that the components work together efficiently, consistently and harmoniously. This complexity is enhanced by the fact that software is often configured in a hierarchy of levels and dependencies. A problem at a low level may manifest itself at a higher level but diagnosis at the higher level may not provide any clues as to the nature of the dysfunction.
The diagnostic probe manager system of the present invention assists customers in diagnosing software problems in the cluster. The invention includes a probe manager and a plurality of probes. Each probe preferably checks only one system component to verify that it is functioning properly and that it has appropriate data. In addition, each probe returns an indication of the other probes it is dependent on. This indication usually identifies the probes of other, possibly related components that should be working properly in order for this probe's component to work. The probe manager queries all the probes that are registered for their specific dependencies. The probes use this information to build a dependency graph so that it can run the probes in order from the lowest software layer to the highest layer. This increases the chances of finding the root cause of the problem, instead of merely finding downstream effects. When a probe finds a problem, it displays the problem (and usually a corrective action) to the user, and the default action of the probe manager is to stop. It is noted that the operation of the diagnostic probes herein does not necessarily have to take place because of or be driven by the occurrence of a problem or fault. The probe manager is capable of initiating probe activity on its own, based on a number of criteria including scheduled maintenance intervals. Furthermore, the probe manager is aware of the fact that certain portions of the data processing system and its related software are more important than others. Accordingly, probes are supplied that examine many of the critical aspects of the operating system as well as many components of cluster systems management software, particularly those that are known to have greater significance in maintaining system operations.
The real value of this diagnostic probe manager subsystem is that the software vendor (in this case International Business Machines, Inc., the assignee of the present invention) is better able to codify its expertise in diagnosing the software, thus contributing to an accumulation of knowledge relevant to how all the components fit together, what things typically go wrong, and the order in which things should be examined. It is like having the smartest developer of the software come to your site and sit down and start looking at the pieces of the software in the most logical order, checking for all the things the software developer has seen go wrong until the problem is found.
Most currently available software diagnostic tools are either structured as a single program or they include a set of hard-coded tools that try to diagnose system problems. These diagnostic tools typically diagnose the operating system of only a single personal computer or workstation. In a data processing system which includes a plurality of independent nodes operating and intercommunicating in a clustered environment, the situation is much more complex. Not only can things go wrong at the operating system level, but the whole cluster software stack can have problems, and multiple machines are involved. As used herein the phrase “software stack” refers to a collection of programs which run below the level of application programs and which exist in a hierarchical arrangement of operational and data dependencies. To tackle diagnosing a system as complex as this, a flexible, extensible, easy to develop solution provides the most highly desired solution.
The architecture of the probe system allows each probe to be developed individually, by separate people. In the typical situation, an expert on a particular component develops the probe for that component. Several utilities are provided for implementing probes so that the probe developer can concentrate on just the things that can go wrong with that component. Because dependent probes are executed first, the scope of what can go wrong with a component is limited to things specific to that component. In contrast, a monolithic diagnostic program responsible for checking the whole software stack quickly becomes so complicated that component experts usually can't develop it. Instead, it usually requires developers that are dedicated to working on the diagnostic tool. However, the probe architecture of the present invention allows development of probes in a decentralized fashion.
Another issue with diagnostic tools is coverage. A diagnostic tool is most useful if it catches a high percentage of users' problems. In the present invention the dependency processing feature and separation of probes allows additional probes to be added over time; in this manner, the coverage is increased and newly added software components are provided with diagnostic coverage. Additionally, newly discovered problems are easily be added to the diagnostic probes.
Another important feature of the architecture of the present probe subsystem is that it can be extended by customers. The probe utilities and the Application Program Interface (API) between the probe manager and the probes permits customers to add their own probes. This allows customers to diagnose applications that they run on top of the clustering software and also allows them to check for errors that they have encountered that supplied probes don't yet catch.