1. Field Of The Invention
The present invention relates to a method and system for providing accurate conductivity and diagnostic information computer systems, and more particularly, to a system and method for identifying failing components along paths within a computer network by using a minimum amount of information and a simple analysis.
2. Description Of The Prior Art
There is a long commercial history of LAN (Local Area Network) and TCP/IP path verification using periodic path tests or on-demand path tests. The simplest of these two methods is known as a xe2x80x9cpingxe2x80x9d. Ping refers to any of several low-network functions that return a testable result as a result of a request made to an end node (a component such as a processor or Ethernet board that has T NET address). Probe based testing is conceptually simple to do. More, importantly, a ping is using the network components in the same way they are normally used. If a ping is successful, there is considerable confidence that the path is working. Finally, because probe based testing is active, redundant paths that happen to be inactive are tested along with active paths, allowing latent faults to be detected soon after they fail rather than when the path is needed. This early detection allows early repair thereby significantly reducing the risk of a system failure.
One disadvantage of probe based testing is that the only way to detect errors is to generate probes and observe the results. The probe based testing must be run often enough to ensure that failures are identified quickly. Often, there are many events available to indicate a potential problem. These events plus a periodic event in a manual request start, all trigger the same set of test probes.
A second disadvantage of probe based testing is that complete coverage of a large network can logically require N*N tests, where N is the number of end nodes that need to be tested. Thus, N can be as large as 64. Without some form of path reduction, probe testing would require 64*64 or 4096 tests. The time to do this much testing is very large.
During these tests, identification of a fault is generally easy. A ping fails normally as a time out. However, a single path traverses many components, therefore a failing path is not diagnostic for any particular repair action.
If any one of the elements fails, many path tests will fail. The larger the system or network, the larger the number of tests that fail. Thus, the challenge is using the failing path information to identify the failing component. This is complicated by the need to identify multiple faults that occur at one time. Path analysis based on the assumption of a single failure is generally risky because there is often more than one fault within the system or network.
One technique for failure isolation is to run diagnostics on each component in each failing path to identify the failure. However, this technique may be very lengthy and complex.
Another technique for failure isolation is to implement a set of rules within an expert system. However, this is generally a significant development task and one that must be maintained frequently. Furthermore, many systems and networks have little or no verification coverage of components that often include significant network functionality. In some instances, fault coverage of components has included building error logic detection into each of the components and then generating an event for any errors detected. However, implementing such an approach would require generating new error information or reading internal status registers to determine if an error had occurred. While such an approach may be effective, it is very expensive in time and components and also is not helpful for monitoring unused redundant paths.
The present invention provides an efficient method and system for providing accurate connectivity and diagnostic information for network components.
In accordance with one aspect of the present invention, a method of identifying failing components within a computer system or network includes identifying signal paths of components within the system as a set of signal paths. The method further includes eliminating never used signal paths from the set and eliminating common sub paths from the set. The remaining sub paths are grouped into outbound paths and return paths and a test signal is sent over each outbound path and a corresponding return path.
Pass/fail information is expressed in terms of components that have been traversed by the test signals and likely failing components are identified in a priority order.
On accordance with another aspect of the present invention, the method orders the likely failing components by ordering a component table by the percentage of tests that failed. This provides for very effective failing component identification.
In accordance with a further aspect of the present invention, the method orders the likely failing components by sub-ordering the component table by the total number of tests.
In accordance with yet another aspect of the present invention, a further sub-ordering of the likely failing components uses component-specific knowledge to thereby produce effective fault isolation. In particular, cables within the system with less than 100% failure rates for all the tests are eliminated.
In accordance with a further aspect of the present invention, the method includes performing additional testing of the components that have been identified as having a high priority as a likely failing component.
In accordance with a further aspect of the present invention, diagnostics are performed on few parts that are further identified as likely failing components.
In accordance with another aspect of the present invention, additional test signals are sent over identified paths to resolve ambiguities with respect to components that are identified as likely failing components.