Over recent years the complexity of computer systems (and in particular, computer networks) has increased considerably, such systems being characterized by the interaction of multiple system entities in providing a variety of different services. One consequence of this is the considerable strain placed on system management resources tasked to keep such systems up and running.
Certain basic fault diagnosis tools have been developed to address these computer system management issues. For example, low-level fault-diagnosis equipment such as protocol analyzers have evolved generally in line with the increasing sophistication of the technologies used for inter-linking system entities. Such equipment has often tended only to serve as an aid to a maintenance engineer or system administrator, telling him/her what can be wrong at the particular point under test. Similarly, higher-level network management systems designed to provide an overview of overall system performance by collecting data from all parts of a network. However, such tools have largely been of limited use leaving much of the problem of understanding what is going on to the network supervisor or system administrator.
Existing tools used to diagnose computer system faults suffer from a number of limitations. First, current tools operate at the kernel level of the computer system requiring that the computer system be taken offline in order to discern faults in the computer system. Additionally, virtually all fault diagnosis and fault correction must involve a human element. In other words, the system administrator must become involved for all fault diagnosis and correction. This is inefficient and extremely time consuming. Also, conventional diagnostic tools do not have the capability to collect enough data to determine the nature of a system fault. Commonly, some error information is acquired (with the system offline) and used to provide some rudimentary suggestions concerning a suitable diagnostic tool. At this point the system administrator must review the error information and select an appropriate diagnosis tool. Almost universally, this diagnosis tool will request further system error information, which must then be collected. This new error information is then provided to the diagnosis tool to attempt fault diagnosis. Once a diagnosis is made, the system administrator takes action to correct the fault. The system is then restarted.