The present invention relates to monitoring, detecting, and isolating failures in a system, and in particular to tools, such as a model of the system, applied for analyzing the system.
Quick and easy determination of failure causes (i.e. failure diagnosis) is a key requirement for providing services like time-to-fix contracts to information technology (IT) departments.
In current solutions, a specially trained engineer is typically requested to go on-site in case a (customer""s) system breaks down. There, s/he may then use software tools to search for the root cause of the system crash. This software is typically a collection of test programs testing parts of the system undertest (SUT). The engineer selects a couple of tests based on experience from previous cases or may choose to run a complete test suite. This requires a reboot of the SUT, thus reducing system up-time from the customer""s perspective. In addition, this approach requires the SUT being functional to a certain degree, so that a minimal operating system, like DOS, is bootable. Otherwise, the engineer is left to his experience.
This conventional approach has several drawbacks. First, it is a manual process. Simple test suites could be defined, but detailed testing is only done on sub-systems that the engineer suspects may cause the problem. Secondly, the SUT has to be rebooted to run the tests. In cases where the system has regained a productive state, this lowers system uptime. Thirdly, conventional test suites check for a list of potential failure causes. This implies that failure causes unknown to the test suite will never be detected. Expert systems show a way out of these problems.
Expert systems have been used for diagnosing computer failures, as described e.g. by J. A. Kavicky and G. D. Kraft in xe2x80x9cAn expert system for diagnosing and maintaining the ATandT 3B4000 computer: an architectural descriptionxe2x80x9d, ACM, 1989. Analysis of data from on-bus diagnosis hardware is described in Fitzgerald, G. L., xe2x80x9cEnhance computer fault isolation with a history memory,xe2x80x9d IEEE, 1980. Fault-tolerant computers have for many years been built with redundant processing and memory elements, data pathways, and built-in monitoring capabilities for determining when to switch off a failing unit and switch to a good, redundant unit (cf. e.g. U.S. Pat. No. 5,099,485).
Prior diagnostic systems for determining likely failed components in an SUT include model-based diagnostic systems. A model-based diagnostic system may be defined as a diagnostic system that renders conclusions about the state of the SUT using actual SUT responses from applied tests and an appropriate model of correct or incorrect SUT behavior as inputs to the diagnostic system. Such a diagnostic system is usually based upon computer generated models of the SUT and its components and the diagnostic process.
It is usually desirable to employ a model-based diagnostic system that is based upon a more manageable model of SUT characteristics. Such a model-based diagnostic system usually minimizes the amount of modeling information for an SUT that must be generated by a user before the system can be applied to the SUT. Such modeling usually speeds the process of adapting the diagnostic system to differing SUTs and increases confidence in the determinations rendered by the diagnostic system.
Model-based diagnostic systems are known e.g. from W. Hamscher, L. Console, J. de Kleer, in xe2x80x98Readings in system model-based diagnosisxe2x80x99, Morgan Kauffman, 1992. A test-based system model is used by the Hewlett-Packard HP Fault Detective (HPFD) and described in HP Fault Detecfive User""s Guide, Hewlett-Packard Co., 1996.
U.S. Pat. No. 5,808,919 (Preist et al.) discloses a model-based diagnostics system, based on functional tests, in which the, modeling burden is greatly reduced. The model disclosed in Preist et al. employs a list of functional tests, a list of components exercised by each functional test along with the degree to which each component is exercised, by each functional test, and the historical or or estimated a priori failure rate for individual components. Such model data may be rapidly and easily determined- or estimated by test engineers, test programmers or others familiar with, but not necessarily expert on, the device under test. Typically, test engineers may develop the models in a few days to a few weeks depending on the complexity of the device under test.
U.S. Pat. No. 5,922,079 (Booth et al.) discloses an automated analysis and troubleshooting system that identifies potential problems with the test suite (ability of the model to detect and discriminate among potential faults), and also identifies probable modeling errors based on incorrect diagnoses.
EP-A-887733 (Kanevsky et al.) discloses a model-based diagnostic system that provides automated tools that enable a selection of one or more next tests to apply to a device under test from among the tests not yet applied based upon a manageable model of the device under test.
In the above three model-based diagnostic systems, a diagnostic engine combines the system-model-based and probabilistic approaches to diagnostics. It takes the results of a suite of tests and computesxe2x80x94based on the system model of the SUTxe2x80x94the most likely to be failed components.
The diagnostic engine can be used with applications where a failing device is to be debugged using a predetermined set of test and measurement equipment to perform tests from a predesigned set of tests. A test represents a procedure performed on the SUT. A test has a number of possible outcomes.
The tests can be defined to have only two outcomes: pass or fail. For the purpose of this invention, devices or components shall be regarded as either xe2x80x9cgoodxe2x80x9d or xe2x80x9cbadxe2x80x9d and tests shall either xe2x80x9cpassxe2x80x9d or xe2x80x9cfailxe2x80x9d. In an example, a test for repairing a computer may involve checking to see if a power supply voltage is between 4.9 and 5.1 volts. If it is, the test passes. If it is not, the test fails. The set of all tests available for debugging a particular SUT shall be called that SUT""s test suite.
Using test results received from actual tests executed on the SUT and the system model determined for the SUT, the diagnostic engine computes a list of fault candidates for the components of the SUT. Starting, e.g., from a priori failure probabilities of the components, these probabilities are then weighted with the model information accordingly if a test passes or fails. At least one test has to fail, otherwise the SUT is assumed to be good.
In all the known model-based diagnostic systems, in particular the provision of the system model has been proved difficult, specifically in rather complex systems. One reason is that for each system an individual system model has to be xe2x80x98createdxe2x80x99 which generally cannot be used even for only slightly different systems. Furthermore, the modeling process turns out to be a rather costly process, on one hand since the modeling is a manual process, and, on the other hand, since this manual process requires highly educated and therefore expensive personnel.
It is therefore an object of the present invention to facilitate the provision of system models to be applied in model-based diagnostic systems. The object is solved by the independent claims. Preferred embodiments are shown by the dependent claims.
The invention provides an improved tool for determining a system model describing a relation between applicable tests and components of a system under test (SUT). The system model can then be applied in conjunction with actual test results for determining at least one fault candidate representing a specific component of the SUT likely to have caused a fault or failure of the SUT. Each fault candidate is preferably provided with a certain probability that the component has caused the failure. The fault candidates can be preferably represented in a probability-ranked list. Thus, the invention can be applied in diagnosis tools allowing to detect any kind of system failuresxe2x80x94hardware, configuration, and softwarexe2x80x94based on the system model and a series of tests.
According to the invention, a logical inference engine determines the system model for the system under test (SUT). The system model can be one of the test-based system models as described in the introductory part. It is to be understood that the invention is not limited to a specific type of system model required for a specific diagnosis application, but can be applied for a wide variety of different system models.
Typical features of the system model can be, either together or individually:
A system definition assigning an identification/reference (such as a name and/or an address) to each component of the system, preferably in a multiple level (preferably two levels) part-whole hierarchy comprising names of the components (typically field-replaceable units FRUs or block level units) and of their subcomponents. It is to be understood that even though the system diagnosis can be accomplished/executed on the lowest level of the hierarchy of system components (in order to achieve highest accuracy), the result of the diagnosis is preferably provided and output only on the highest level of the hierarchy of system components (in order to provide a more concise result with highest understandability).
A failure probability for each component, gained e.g. from experience or quality tests provided e.g. from the manufacturer of the components. The failure probability represents the probability that a specific component will fail. The failure probabilities are applied for assessing the relationship between test results and components. The failure probabilities might represent quantitatie values, such as x% failure probability, and/or typical qualitative failure probabilities of the components, e.g. in away describing that component A fails twice as often as component B. A priori failure probabilities can be provided as starting values and might, be refined/modified over the time, e.g. dependent on newly acquired information about actual failure probabilities of the components or in accordance with foreknowledge of how failure probabilities vary, over time. The a priori failure probabilities might be assumed to be the maximum values (e.g. the a priori failure probability is set to be 1 or 100%) for new or unknown components, and will then be refined during the course of the testing procedure.
A test definition assigning an identification/reference (such as a name) to each applicable test. Further information about the content, modalities, characteristics, and execution of tests might also be comprised by the test definitions.
A coverage that each test has on each component, i.e., the proportion of the functionality of the component that is exercised by the test, or more formally, the conditional probability that the test will fail given that the component is bad. The coverage describes the amount of functionality of a component required to pass a specific test. In other words, when the specific test has been executed without failure, whereby a number of components have been involved in the test, the coverage denotes how much (e.g. which percentage) of each component involved must be functional since the test has passed. In an example, an access on a memory involves the memory, a memory bus and a controller. To pass an access test of the memory would require e.g. 30% functionality of the memory (since the memory will be tested only partly), 90% functionality of the memory bus (since most signals of the memory bus are required to transmit the respective data), and 50% functionality of the controller (since only a part of the controller is involved for accessing the respective part of the memory).
Shared functions of tests, which are a way of system-modeling tests that are dependent because they test the functionality of some components in (substantially or exactly) the same way. For example: two tests that access a certain component through a common cable have shared coverage on the cable. In other words, the shared, function denotes that a specific component is tested in plural tests in (substantially) the same way. The shared function may relate to individual components (e.g. the shared function of the memory is x% in all tests generally involving the memory) or to certain types of tests (e.g. the shared function of the memory is y% in all tests involving a write access on the memory). In the above example applied e.g. for two different memories A and B, both memory tests apply 70% functionality of the memory bus in an identical manner, so that of the shared coverage on the memory bus is 70%.
In most system models, the system and test definitions together with the coverages represent the most important information, which has to be there right from the beginning of the tests. The failure probabilities can be defined or assumed almost arbitrarily and can be refined successively. The shared functions represent optional information for most system models, however, useful for providing more meaningful diagnosis results.
For determining the system model for a specific SUT, the logical inference engine applies static information and/or configuration information. Static information describes general information about the SUT, and/or possible. components of the SUT, and/or tests applicable for the SUT, and might further describe relations therebetween. Configuration information, in contrast thereto, describes the actual set of components, in the SUT.
Static information can be gathered e.g. from one or more of the following sources (which might also be referred to as databases, lists or the like):
Static SUT knowledge comprising general information about the general system type of the SUT, independent of the specific setup of the actual SUT. The static SUT knowledge might comprise a set of rules applicable to the supported hardware platforms and architectures. (e.g. Intel-based systems). These rules may describe e.g. data paths between component classes (e.g. system memory and PCI cards or CPU), and might define relationships between functional units and FRUs, e.g. the functional components placed on a specific FRU. The static SUT knowledge might comprise generic definitions of components, e.g. like the generic behavior of PCI or SCSI devices.
Static component knowledge comprising general information about possible components applicable in the SUT, independent of the specific component setup of the actual SUT. The static component knowledge preferably comprises failure probabilities, and might also (instead of or in addition to the static SUT knowledge) comprise generic definitions of components.
Static test knowledge comprising general information about possible tests executable on or applicable in the SUT, independent of the specific tests actually executed/applied for the actual SUT. Each test available is listed in this database. The static test knowledge preferably comprises the coverages and/or shared functions associated to each component.
Configuration information describes the actual set of components in the SUT. The configuration information is preferably directly acquired from the SUT to be diagnosed e.g. by an adequate acquisition unit (i.e. a device capable of reading the desired information) or defaulted, e.g. with information derived from the static SUT knowledge. Actual components present in the SUT can be identified e.g. by, applying scanning algorithms and/or by using the SUT knowledge (e.g. since some components xe2x80x98simply have to be therexe2x80x99). Different algorithms can be implemented for scanning busses in the SUT, like PCI bus, SCSI bus, USB etc.
The acquisition of the configuration information can be performed in an iterating and/or refining process, e.g. applying a first scanning for components preferably starting with sub-systems closest to the unit(s) acquiring the configuration information, e.g. the PCI bus as a common top-level bus system. During this first scanning, bridges to sub-systems lower in the SUT hierarchy can be detected eventually, thus leading to a second component scanning, e.g. of the SCSI bus. The second scanning can be done by the same acquisition units or new adequate devices found by previous scanning iterations. This procedure is preferably repeated until no more new sub-systems are found in the SUT.
The scan results are preferably correlated with the static component knowledge, either once the scanning has been finished or as an ongoing process during the scanning. The configuration information might comprise a list of FRUs that users can add, like memory and add-in cards, or data read from system internal structures.
In accordance with the system definition (cf. the above said), the configuration information comprises a unique identification for each component for clearly identifying and distinguishing between each of the components of the configuration information. The identification can be accomplished by assigning a device identifier and an address to each component of the SUT. The device identifier describes the (generic) component class, thus allowing to differentiate e.g. between a hard disk (as one generic component class) and a CDROM drive (as another generic component class) located on a SCSI bus. The address then allows distinguishing between two hard disks (of the same generic component class), whereby one has the address A and the other has address B. The address preferably uses or equals the system address of the component in the SUT.
The configuration information can be stored (at any time during or after the modeling process) as reference data. When the configuration information is reacquired during the course of the monitoring to the SUT, any difference from this reference data can then also be interpreted as a diagnostic result. E.g., a missing component (i.e. a component defined by the reference data as being present in the SUT, which has been removed from the SUT after setting up the reference data) can be interpreted as a total failure of this component. In addition, many systems react to defective components by configuring certain related components slightly differently, e.g. by disabling bad sectors of a hard disk or disabling data transfer optimizations in controllers, to bypass the problem. This further shows that the configuration information used for building the system model of the SUT can be applied twice: for modeling purposes first and then for comparison with information about the components gathered after storing the reference data.
The reference data can be applied as a xe2x80x98known-goodxe2x80x99 reference representing a state wherein the SUT has been proved to be functional. It is clear that the reference data itself can also be subject to modification(s) during the course of monitoring the SUT.
For determining the system model of a specific SUT, the configuration information is combined with the static information for extracting, from the static information, the information that is required (only) for the specific SUT. In a diagnosing operation mode, i.e. when the SUT and the diagnosing system have already been set up, the configuration information is taken preferably from the reference data. This would result in an equal model for several diagnoses. However, the model-building procedure also works on recently gathered, configuration information. Dependent on the degree to which the, configuration information has already been gathered from the SUT, at the time of the model-building procedure, the model-building procedure can be accomplished by one or more iterating steps. This procedure has to be chosen if no reference data is available.
In case thatxe2x80x94at the time of the model-building procedurexe2x80x94the configuration information has already been fully or at least sufficiently gathered from the SUT (or reference data is available), a logical inference engine provides for the components of the configuration information a list of component instances defined by the device identifier and the system address. The term xe2x80x98listxe2x80x99 as used. herein does not necessarily mean a list in form of a table or the like, but can apply to any kind of data association as provided e.g. in data processing units.
The component instances of the list of component instances (from the configuration information) are then combined with the static information in order to find further components in the SUT which might not have been identified in the scanning process for determining the configuration information. Such further components can be, for example, components that have to be present (as defined e.g. by the static SUT and/or component knowledge) in a functional SUT but cannot or have not yet be detected, such as batteries, connectors or cabling.
Using the static information, the list of component instances is then associated with:
one or more tests (as defined by the static test knowledge) covering at least one component of the list of component instances,
the coverages (as defined by the static test knowledge) assigned for each component of the list with respect to a specific test, and
the failure probabilities (as preferably defined by the static component knowledge) assigned for each component of the list.
The selected tests can then be associated with the shared function information, dependent whether shared functions are applied in the system model or not.
The list of component instances associated with the above further information represents a matrix of the system model, whereby, again, the term xe2x80x98matixxe2x80x99 as used herein does not necessarily mean a matrix in form of a more or less complex table or the like, but can apply to any kind of data association as provided e.g. in data processing units.
The one or more selected tests can be applied as a test suite for diagnosing the SUT.
It is to be understood that the configuration information does not represent a fixed or static source of information but can also be subject of ongoing modifications e.g. under the influence of further executed tests. Accordingly, the determined model need not represent a static model, which after having been determined will remain xe2x80x98as isxe2x80x99. The determined system model can also be subject of ongoing modifications under the influence of newly gathered information e.g. from executing the selected tests. For the sake of simplicity, however, the configuration information shall be regarded as finalized at a certain moment, e.g. once a test suite with a sufficient number of tests has been selected. Accordingly, the system model may also be regarded as finalized at the same time.
In a preferred embodiment, the above-described model building process is executed in a series of steps. In a first step, the list of component instances, defined by their device identifier and their system address, is formed. In a second step, a matrix of coverages is formed, wherein one axis of this matrix is defined by the list of component instances. This process has to take care to define a coverage in that way that one coverage is defined for each instantiated component. For the two hard disks A and B this would mean that two coverages would be associated with hard disk tests X and Y, where test X tests hard disk A while test Y tests hard disk B. This process may also take into account that testing a component Z from a data acquisition unit U implies a certain coverage on all components implementing the data path from U to Z or Z to U, like busses, connectors, cables, or bridging devices. All these components, again, might imply coverage on secondary components enabling these primary components to operate, like batteries or power supplies. In a third step, a list of tests is formed, which is simply the other axis of the coverage matrix, i.e. all tests that are found to have coverage on any component in the SUT are members of this list.
In case thatxe2x80x94at the time of the model-building procedurexe2x80x94the configuration information has not been fully or at least sufficiently gathered from the SUT, the process for determining the system model as outlined above can be repeated iteratively, until the configuration information has been fully or at least sufficiently gathered.
The thus determined system model can then be applied in any one of the model-based diagnostic systems as outlined in the introductory part of this specification. It is, however, to be understood that the system model determining process has to be adapted to each respective model-based diagnostic system, since each of those diagnostic systems might require the system model in a specific data configuration and form. In addition, each diagnostic system might require different features of the system model, such as additionally requiring shared functions.
The system-modeling scheme according to the invention brings a couple of advantages. The system model can be determined automatically without requiring manual aid, since all relevant information is already provided by the static and configuration knowledge Although a certain (in most cases mainly manual) effort will have to made in advance for providing the static information, this can be done on a very general base for a, variety of different systems in common. Thus, the invention is in particular useful for diagnosing a plurality of similar system, each system, however, requiring a (slightly) different system model.
The system model determining process according to the invention further represents an open process that allows a more or less continuous refinement and/or actualization of the system model. Thus, only little (configuration) information is required for the very beginning of a diagnosis process.
The invention further allows enhancing the system model even on components that cannot or only hardly be detected by actually scanning the SUT. Moreover, the use of reference data even allows finding problems in not or only little known components, as simply the difference from the reference data indicates problems somewhere in the SUT.
The tests of the test suite to be executed on the SUT can be more specifically selected for the SUT, since the tests are chosen dependent on the actual configuration. When designing tests, only little care has to be taken on general applicability. Strict parameter ranges can be specified instead of going for the least common denominator. E.g., a hard disk controller supports two different modes, like performance mode and secure data mode, but for one particular system it is always operated in the one and same mode. In this case, a test is written preferably for this controller in this particular mode. The other mode is neglected for this system.
Therefore, the definition of tests becomes simpler than in conventional approaches by cutting complex tests in small, easily manageable units.
Preferably, a first test would detect the mode of the component whereas a second test would be defined especially for this mode. The previously described model building process would then take care of selecting the correct second test. The cost of having more tests to write pays off because of an orthogonal set of test features (not overlapping, unambiguous) and a probable reduction of total test execution time.
It is clear that the invention can be partly or entirely embodied by one or more suitable software programs, which can be stored on or otherwise provided by any kind of data carrier, and which might be executed in or by any suitable data processing unit.