1. Field of the Invention
The present invention relates in general to computers, and, more particularly, to an error management system and method in a storage subsystem.
2. Description of the Prior Art
Storage subsystems having a plurality of subcomponents are increasingly apparent in the art. A particular storage subsystem consists of two controller cards which are connected through a backplane. Each of the two controller cards houses two main processors and a processor bridge, as well as various other integrated components, and acts as a separate entity with redundancy capabilities. One main processor runs Linux and houses a proprietary “Shark” user mode program and kernel module, while the other processor runs a proprietary host adapter software and no third party operating system. The two processors typically only communicate through peripheral component interconnect (PCI) data transfers and through control words written to the mailbox 0 register in the processor bridge.
When the server processor hangs, or a user mode process dies or is killed unexpectedly, one typically does not have time to notify the host adapter code to drop light to (disconnect from) the host(s). Therefore, the host adapter will continue to accept new requests from the host and will keep sending the requests asynchronously to the server processor for processing.
One solution to the lack of notification time has long been known in the art, referred to as the “suicide panic,” in which the host adapter will notice that it has not received mail (structured data sent to a specific location across the PCI bus) from the server in some number of seconds, then decide on its own to drop light to the host(s). However, this implementation has several drawbacks. The current design does not confirm in any way that the server processor is hung or that the user mode process has exited. It merely waits some amount of time and then disconnects, as described.
In addition, there are several scenarios in certain topologies in which it is normal for the server to not send mail to the host adapter for a long period of time (e.g., failover/failback). Since the server processor and host adapter share a memory controller, resetting the adapter means that the entire system must also be rebooted. One cannot afford to lower the time limit to a value that might be in the range of a normal recovery action since effecting a suicide panic on a host adapter serves to take down the entire system. Thus, the current timeout value for certain scenarios is set to approximately 800 seconds. This value is unacceptable because it is considerably larger than some hosts will allow their input/output (I/O) requests to be held out for.
Thus, a need exists for an error notification implementation which significantly reduces the timeout value to less than 15 seconds (the default value for some hosts) to avoid a loss of access on the hung paths.