The present invention relates generally to a fault monitoring system and a method of controlling the same for a data processing system or systems. More particularly, the present invention is concerned with a Control system which is profitably and advantageously suited for carrying out rapidly an initial diagnosis and effectuating recovery command from a remotely located place or station when some fault occurs in the data processing system (hereinafter also referred to as computer system).
As the range of applications of data processing systems or electronic computer systems increases, the system structurization tends to be of larger and large scale with correspondingly increased complexity. Under the circumstance, great importance is put on the improvement of reliability, enhanced fault tolerancy (fault withstanding capability), rapid restoration of the system after occurrence of a fault and others.
In the data processing systems developed in recent years, there exists a general tendency that the main body of a data processing system or computer system is additionally equipped with a maintenance and control-dedicated apparatus for performing maintenance and diagnosis of the computer system. This type of control apparatus is known as a service processor (SVP in abbreviation), a typical one of which is disclosed in U.S Pat. No. 4,204,249. Besides, in JP-A-58-56158, there is disclosed a control system in which a plurality of user computer systems are subjected to maintenance and diagnosis performed by a Computer system installed at a maintenance center located remotely. Additionally, there is disclosed in JP-A-61-148542 a control system capable of manipulating a display of the SVP from a remotely located place.
According to the technique disclosed in U.S. Pat. No. 4,204,249, a concentrated management for a plurality of processors is made possible by imparting a power on/off control and a microprogram loading control to the SVP. It is noted above all that by wiring the dedicated signal lines extending directly from the control apparatus to a group of processors, there is required a smaller number of wirings when compared with the parallel wiring system known heretofore.
According to the technique disclosed in JP-A-58-56158, a computer system installed at a maintenance center monitors constantly a plurality of user's computer systems sequentially for the purpose of detecting in advance the occurrence of faults or obstacles. Further, there is disclosed in JP-A-61-1485542 a system in which a display of a service processor or SVP can be manipulated from a remote place by using at the maintenance site a display control program of a same structure as that of a display control program in the SVP and substantially same processing procedure. To this end, a data buffer is provided in the SVP and the content of this data buffer is transferred to the maintenance site.
As the full-time (24-hour) operation service of the data processing system is increasingly adopted with the fields of applications thereof being widened, there are required not only development of techniques for improving and enhancing the reliability and fault tolerancy of the data processing system but also such control means which can ensure rapid recovery of the system after occurrence of the fault. The rapid recovery may be accomplished when a maintenance engineer resides all the time at the site of the user's computer system. However, as the 24-hour operation service spreads and unattended operation comes into general use, it becomes necessary that the system maintenance engineers are standing ready at the maintenance center for performing the fault monitor and maintenance for a plurality of user's computer systems. To this end, the function for detecting occurrence of a fault in the user's computer system from a remote place has to be filled up while providing control means for the rapid recovery.
When the prior art techniques are viewed in the light of the above, the technique disclosed in U.S. Pat. No. 4,204,249 permits arbitrary adjustment of electric power supply unit and facilitated alteration of wirings as the computer system structure becomes complicated. More specifically, the power-on/off and voltage regulation can be performed by the SVP. However, this patent teaches neither the fault monitoring or maintenance method from a remote place. Further, the system disclosed in this patent is imposed with a constraint that a group of processors constituting a data processing system have to be supervised by the SVP installed at the system site.
According to the technique disclosed in JP-A-58-56158, a computer system installed at a maintenance center is adapted to perform communication with the SVP of user's computer systems installed at different sites for monitoring then cyclically with a view to enhancing the availability of the maintenance center computer system while trying automation of diagnosis by cataloging the monitoring procedure. However, there are not disclosed in concrete in JP-A-58-56158 any fault detecting means, practical items of logging information, the standards or reference information for the fault decision and others.
On the other hand, according to the technique disclosed in JP-A-61-148542, the SVP of the computer system at the user's site is provided with a data buffer for the purpose of generating on a remote display unit a same display as that generated by the SVP of the computer system installed at the site, so that the content of the data buffer can be displayed on the SVP display of the user's computer system and the remotely located display device, whereby logical structure of processing programs are simplified. By virtue of such arrangement, it is possible to manipulate the SVP of the user's computer system with the aid of the display device installed at a remote maintenance center. At his juncture, it is noted that the SVP is intrinsically designed for backing up the maintenance operation and capable of detecting a fault so far as it occurs in hardware. However, detection of erroneous operation of software, i.e. operating system or OS is in general difficult or impossible. Usually, when an OS is running, the fault detection is mainly carried out by monitoring OS-oriented console messages and the like. However in JP-A-61-148542, no consideration is made as to the timing for changing-over the SVP displays at the user site and the remote location, the console message detecting means, method of reporting occurrence of fault, items of fault information to be collected from the SVP display information, and the collecting method.
In order to realize the detection of a fault occurring in the user's computer system and rapid recovery of the system after occurrence of fault at a maintenance center or the like place, there remains as a problem to be solved a method how to realize a mechanism which is capable of collecting instantly the process of behavior of the computer systems of concern at a remote place. In general, history of the behavior of the OS can be acquired by tracing the messages which have been outputted on the OS console. Usually, the hard copy device for outputting the console messages is located near the user's computer system. However, when that computer system is operated in a so-called unattended mode, the power supply for the hard copy device is turned off in most cases in order to evade such undesirable situation as exhaustion of copy paper and jamming thereof.
Further, for disposing of the fault, particular areas in a main storage have to be referred from a remote place. As such areas, there may be mentioned an area where OS managing information is stored and an area used by hardware.
Needless to say, when a user's computer system is operated in the unattended mode, there are present at the system neither operator nor maintenance engineer. Accordingly, there exists a demand for provision of control means for detecting the occurrence of faults. Further, when occurrence of a fault or obstacle is recognized at a remote station, it is necessary to perform initial analysis until the time point when a maintenance engineer has arrived at the site of the computer system suffering from the fault. Provision of the control means mentioned above can naturally contribute to rapid recovery of the computer system after the occurrence of a fault.