The present invention relates to monitoring system health during initialization of a computer system.
It is known to provide a watchdog function for monitoring system health of a running system. A watchdog function typically functions by means of the operating system outputting signals at regular intervals to a hardware timer that counts between determined values (e.g., from a given value down to an underflow value, or from a given value to an overflow value). The signals that are output are used to reset the counter to the appropriate given value so that the overflow or underflow value is not reached. This is sometimes described as “patting a watchdog timer”. In the event that the operating system hangs, or there is some other system failure that means that the signals are no longer output, then the timer will reach the overflow or underflow value as appropriate and this is used to indicate that a fault has occurred. In such a normal operating state, the frequency of the output signals and the length of the period represented by the difference between the given and overflow/underflow count values can be chosen in a predictable manner such that the overflow or underflow of the counter does not occur during normal operation.
In a highly available system, which includes a plurality of less highly available sub-systems, it may be necessary to restart the sub-systems from time to time to maintain overall operation in response to transient faults. Although a conventional watchdog approach can readily be used to monitor the health of the sub-systems where the operating system of those sub-systems is operational, such an approach to health monitoring cannot be used during state transitioning situations, for example at initialization or in abnormal states where the operating system is not operational.
The present invention seeks to address the monitoring of the health of such sub-systems in such situations.