1. Field of the Invention
This invention relates in general to a computer systems, and more particularly to a method, apparatus and program storage device for providing automatic recovery from premature reboot of a system during a concurrent upgrade.
2. Description of Related Art
Storage controllers are used in storage systems to control arrays of hard disk drives including storing data in a distributed manner in multiple disk drives and having redundancy information (such as parity information) as well as data to be stored in the disk drives. To prevent data loss in the event of a disk drive failure, storage controllers may be configured to provide a range of different types of data redundancy including for example RAID 1, RAID 5 and RAID 0+1. Host computer typically do not see devices that correspond directly to the individual disk drives; rather storage controllers create logical devices. If a disk drive fails, the storage controller uses the redundancy information to recover the information stored in the failed disk drive.
In addition, a storage controller may be configured with a plurality of storage clusters, each of which provides for selective connection between a host computer and storage devices and each preferably being on a separate power boundary. Each cluster might include a multipath storage director with first and second storage paths, a cache memory and a non-volatile storage (“NVS”) memory.
In many storage products, two or more controllers are used to provide redundancy. This redundancy can prevent interruption of service in the event of a software or hardware failure on one of the controllers. In addition, the redundancy can be leveraged when code (software or firmware) updates are provided. One type of code update process is re called concurrent code-load. Concurrent code-load processes generally require the computer system to be fully operational before a code-load upgrade is begun.
Errors and other unforeseen circumstances can cause the code-load upgrade process to fail in the middle due to a premature reboot of the system. Premature reboot of a system can lead to a degraded system state including where either only one controller is active (either running on the old code or the new code) or where one controller is running with the new code and the other controller is left running on the old code which may also result in unanticipated errors. In the former case, the overall system is exposed to a single point of failure and significant performance degradation.
Recovering from premature reboot failures can be a lengthy and expensive process. Manually restoring a system to a fully operational state so that a code-load upgrade can be retried often requires a trained system administrator with knowledge of the internal code-load actions. In addition, where there are multiple clusters, it is typically difficult to achieve a fully operational state if the premature reboot occurred after an update of only one of the clusters. Because of this, the user may be required to switch back to the original code level to reach a fully operational state. In addition, many users do not have sufficient knowledge of internal code-load actions to fix a code-load failure and must contact field service personnel.
It can be seen that there is a need for an improved method of recovering from premature reboot of a system during a concurrent code-load upgrade.