The present invention is directed to a method for isolating a fault in a peer-to-peer communication link between two processors.
Numerous computer systems have been assembled with two redundant processors. In particular, data storage systems are known in which a pair of storage processors are connected to an array of disk drives. For example, such a system is disclosed in U.S. Pat. No. 5,922,077 (Espy et al.). The full disclosure of said patent is hereby incorporated by reference herein. Espy et al. describes a dual data storage controller system in which the controllers are connected to one another by a peer-to-peer communication link. Each data storage controller is connected to a fibre channel loop in connection with each of the disk drives in the disk array. Fail-over switches provide each data storage controller with a means for connecting to either one of the fibre channel loops.
The peer-to-peer link between processors is typically used for configuration management. Prior systems lacking a dedicated peer-to-peer communication link may have used back end buses for transportation of configuration management information. If there is a failure in this back end link, the loss of all back end buses is implied in which case the processor will not be able to access the disk drives. As such, there is no benefit in continuing to operate the storage processor and it might as well be shut down. In such a configuration, if the back end is working and the storage processor cannot contact its peer, it is safe to assume that the peer is dead.
In the case where the peer-to-peer communication takes place over a single dedicated configuration management channel between the two processors, if a processor cannot communicate with its peer, it can""t be sure if the peer is dead or if the communication channel failed. Therefore, it does not know if certain operations can be safely performed. In order to address this problem, alternative peer-to-peer communication links were developed. In particular, by the use of mailboxes on disk drives, processors may be able to communicate through the disk drives to coordinate a safe completion of operations that require coordination between the two processors. For example, when a write cache is mirrored between the two processors, such operation needs to terminate when communication over the peer-to-peer link is interrupted. Upon termination of the mirroring, a write cache needs to be dumped to complete the write operations. The alternative communication link through the mailboxes on disk drives permits coordination between the processors so that only one of them dumps the cache so that the cache dump can proceed without interference from the other processor. It is only necessary that one of the caches be dumped since they have been mirrored. While termination of the cache mirroring operation is able to be carried out through the mailboxes of the disk drives, there remains the problem of identifying the cause of the failure of the peer-to-peer communication link.
In accordance with embodiments of the invention, dual processors are programmed to perform a fault isolation method so that the processor causing the fault in the peer-to-peer communication link can be replaced while the other processor remains operational during the fault detection.
An embodiment of the method of the invention involves a number of steps beginning with detecting an inability to communicate over the peer-to-peer communication link. The method suspends a preselected one of the storage processors and suspends its operation arbitrarily indicating a fault with that storage processor. A user replaces the indicated storage processor with a new storage processor. The other storage processor detects replacement of the first storage processor. The new storage processor powers up. Before fully booting up, the storage processors test the peer-to-peer communication link. If the problem with the peer-to-peer communication is solved, operation returns to normal. If after the first storage processor was replaced and the peer-to-peer communication link still fails to provide communication between the two processors, the second storage processor recognizing itself as a survivor, instructs the first storage processor to boot up and then suspends operation.
An embodiment of the program code for powering up a storage processor includes code for attempting to communicate over the peer-to-peer communication link. Code is further included to permit communication with its peer storage processor by leaving messages in at least one of the disk drives. The code knows to suspend the storage processor in a first chassis position when there is a failure to communicate over the peer-to-peer communication link, and both storage processors are running in normal operation mode. Code is also included so that a survivor of a previous peer-to-peer communication link failure detecting replacement of the peer storage processor and an inability to communicate over the peer-to-peer communication link, instructs its peer through leaving a message in at least one of the disk drives to boot up and suspends itself. In the survivor mode, if replacement of the peer storage processor has not been detected and a peer-to-peer communication link failure is alerted, it will instruct its peer through leaving a message in at least one of the disk drives to suspend operation.
In accordance with the data storage system of an embodiment of the invention, a first and second storage processor are connected via a peer-to-peer communication link. A first communication channel connects the first storage processor to a plurality of disk drives. A second communication channel connects the second storage processor to the disk drives. A program code is maintained in each processor for enacting a survivor mode in which the processor can detect replacement of the peer storage processor. Each of the storage processors may be provided with a failure indicator which could be a light, a buzzer, or the like so that a user knows which processor to replace.