Computer solutions are increasingly implemented in the form of distributed computer processors connected by a communications network. An example of such a network is a series of interconnected workstations such as the IBM Personal System/2 (IBM PS/2) computer or the IBM RISC system/6000 workstation (Personal System/2, PS/2, and RISC System/6000 are trademarks of the IBM Corporation). These workstations may be connected by a local area network (LAN) operating as a token ring, ethernet, or other form of network.
In attempting to manage and maintain a set of commonly shared workstations in a local area network used for distributed computing, problems such as system outages, reduced availability or degraded performance can often occur. Frequently, the origin of such problems may be an inappropriate configuration change to the hardware or software on a workstation in the LAN. Although such problems can be analyzed and recovered from by manually correcting the workstation configuration, investigating each problem and determining its cause is often a time-consuming process since it is typically not easy to identify the workstation where the inappropriate configuration change was made. Additionally, even after identifying the affected workstation, when the inappropriate change was made or by whom often remain unknown. Thus, although the workstation configuration could be corrected to resolve the immediate problem, the same user might return later to the same workstation or to another workstation and make the same mistake again, thereby repeatedly disabling or impairing the distributed computing system.
Limiting access to workstations is typically not a viable option in the work environment. In some cases, users have perfectly valid reasons for needing to change the hardware or software configuration on a workstation, and know how to perform the procedures correctly. In other cases, however, a workstation configuration could be updated incorrectly for one of a variety of reasons, including: the user did not know how to correctly update the workstation; the user understood how to update the workstation, but made an inadvertent mistake such as a typographical error; the user "borrowed" a piece of critical equipment for use in another workstation; the configuration was corrupted by defective hardware or software; or deliberate user mischief may have occurred, whether frivolous or malicious.
A wide array of diagnostic tools exists for inventorying and monitoring hardware and software. However, most of these diagnostic tools must be run manually and locally on each individual workstation. Further, report files are generated and typically saved on each individual workstation, rather than in a consolidated database.
On a system level, solutions typically use "passive" monitoring techniques in which a server workstation listens for error signals sent by other stations in the network. When such a technique is employed, it relies on the other stations to accurately report errors as they occur. Several disadvantages to this approach are apparent. Specifically, because each station is separately programmed to report error conditions, it is difficult to administer monitor changes since each individual station is affected if new types of error monitoring need to be added to the diagnostic system. In addition, conditions which are not necessarily errors, but which may indicate a potential hazard, may pass undetected. This is because such systems typically report failures only as they occur instead of periodically running selected diagnostic routines. Finally, it may be possible for a user to tamper with an individual workstation and thereby prevent the workstation from reporting an error to the server workstation, while simultaneously proceeding with other deliberate mischief, all the while going undetected.
A few of the existing diagnostic tools can be activated remotely, and can save information to a consolidated database. However, these solutions typically suffer from a number of drawbacks, including: dependence upon a centralized LAN server, which can make the tool unusable in case of failure of a critical workstation or communications link; inability to run the diagnostic tool automatically (i.e., unattended) at a specific time interval; inability to save information on previously reported configuration data; lack of an early warning system to draw attention to potential system problems; lack of tuning parameters or rule databases to adjust the behavior of the diagnostic tools, such as which conditions to report or to ignore; vulnerability to attempts by a malicious user to deceive the tool into reporting no problem, while tampering with a remote workstation; and excessive "false positive" reports where the tool does not tolerate momentary outages at a remote workstation.
The peer-to-peer system and method for remote inventorying and monitoring presented herein address the deficiencies of the above-discussed existing art in the distributed processing environment.