To provide consistent, high performance client support, businesses typically rely on high availability systems. High availability systems are designed with some level of redundancy in order to provide fault tolerance for single points of failure. For example, two components that are typically replicated in high availability designs include the processor and power supply. Thus, in the event that one of the processors or power supply fails, the redundant pair can be used to support the processing goals of the subsystem.
One feature which is often requested in high availability systems is the ability to reset one of the processors should the processor become unstable. In typical high-availability designs the processor reset function is provided as one component of a layered supervisory function. The layered supervisory function includes, in many models, a high level Supervisor that monitors the processors' operating status and sanity by regularly polling the processor for status information. The Supervisor is generally implemented in a combination of hardware and software. During operation, the Supervisor communicates with a lower level supervisory processor (SP) that is generally physically located on the same physical module as the processor being monitored. During operation, if the processor should get into an unrecoverable state, it needs to be reset. The SP acts in response to commands from the Supervisor to reset the processor. To then reset the processor, the SP issues commands to the power supply associated with the failed processor, to cycle the power to the failed processor. During the power cycle, the power supply is disconnected from the failed processor for a predetermined period of time and then is reconnected. When power is reconnected to the failed processor, the processor undergoes its predefined initialization procedures, hopefully getting the processor back into an operable state.
One problem with using the above described layered supervisor function is that in order to ensure high-availability at least one redundant copy of the Supervisor, and advantageously the SP, should be provided. Without the additional copy of the Supervisor and SP, a fault could result in the inability to properly observe and reset the associated processor. However, providing an additional copy of the Supervisor and SP pair introduces additional hardware, cost and complexity to the system design. Thus, the inherent problem of redundancy is encountered, where cost is incurred without any added system performance. It would be desirable to determine a low-cost method for providing system reset in a high-availability system.