This invention generally relates to a remote maintenance monitoring system for monitoring hardware devices, and more particularly concerns a retrofit monitoring system for use in performing integrated diagnostic maintenance of a large-scale computer system having a plurality of distributed hardware devices. The invention further more particularly concerns real time critical point monitoring with non-intrusive sensor implants, combined with expert system diagnostics for automated maintenance, responsive to both intermittent and hard failures.
Monitoring and maintenance of hardware devices present technical problems, especially where no builtin testing features are present in the device to be monitored. The tasks presented are particularly problematic where there are a plurality of devices comprising a distributed large-scale integrated system to be monitored. As the scale of the distributed system expands, with each device thereof having multiple components therein subject to failure, the likelihood generally increases that certain intermittent (i.e., transient) failures of various components and/or devices are either never detected, or are inadequately identified for time-effective correction thereof. Hard failures are no less troublesome, particularly if they result in system degradation during critical system operations.
The need for diagnostic maintenance of large distributed systems, particularly as a retrofit feature to devices not having built-in testing features, is both a wide spread and a multi-faceted problem. The NASA Kennedy Space Center in Florida has several systems which exemplify the technical problems presented with monitoring operations and maintenance of large distributed systems, such as multi-unit computer systems. Such large-scale distributed computer system maintenance needs are problems that face most ground support and space-based systems at the Center.
One such system at Kennedy Space Center is known as the Launch Processing System (LPS). The Launch Processing System is an integrated network of computers, data links, displays, controls, hardware interface devices, and computer software required to control and monitor flight systems, ground support equipment, and facilities used in direct support of shuttle vehicle test activities. The LPS has three major subsystems: the Checkout, Control and Monitoring Subsystem (CCMS): the Central Data Subsystem, and the Record and Playback Subsystem. The purpose of the CCMS is to provide a method for testing, checking out, safing, and operating the vehicle during Shuttle ground operations. The CCMS includes nine different hardware sets with over 200 Modcomp II/45 minicomputers.
Maintenance of a large distributed computer system (like CCMS, having over 200 computers) is a complicated task involving highly manpower intensive diagnostic methodology. Conventional front panel and scope trouble-shooting is one significant limitation on timeeffective maintenance of such large scale distributed systems. Additionally, the particular Modcomp computers of the Kennedy Space Center CCMS lack any built-in self-testing capabilities. Additional factors for any large scale system are increased maintenance needs due simply to the aging of the various hardware components, and potential losses in diagnostic expertise (i.e., attrition among skilled maintenance technicians and engineers). All of the foregoing factors have the potential for adversely impacting any manpower intensive maintenance program.
Traditional maintenance methods relying on limited front panel indications, "roll up" diagnostics, and scope trouble-shooting (all of which generally requires engineers and technicians to be experts on the particular systems being monitored) are inherently limited. Such limitations are particularly highlighted as the scale of the maintenance problem increases, and time constraints and the need for system operational competence increase. Significantly, studies have shown with respect to the Kennedy Space Center's CCMS that, based on operational Modcomp computer history, a significant number of intermittent or transient failures are never found (i.e., specifically isolated) where such traditional monitoring and maintenance methods are utilized.
Another inherent limitation to maintenance of a large system (such as CCMS) is that on-line hardware monitoring of the operational large scale and distributed computer system is extremely limited, since the devices are not subject to being repaired (i.e., can't be operationally interfered with) during critical portions of their operations. Traditional diagnostic methods, often based on ineffective testability of the device as originally designed, frequently result in ambiguous testing results which are difficult to interpret. Intermittent and transient failure problems present particular trouble-shooting difficulties, since the originally designed test points (if any) are normally insufficient for unique fault isolation.
Due to the size of the CCMS, and many other systems having similar maintenance problems, the cost of retrofitting a closed-bus architecture with a built-in self-test capability is normally prohibitive. Notwithstanding such further consideration, all of the foregoing discussed problems generally result in greater than anticipated (or desired) operational cost and downtime for the distributed system. Additionally, the foregoing traditional manpower intensive diagnostic techniques, as applied to large distributed systems, provide virtually no information which would allow anticipation of approaching system failures.