1. Field of the Invention
This invention relates in general to a computer systems, and more particularly to a method, apparatus and program storage device for performing fault tolerant code upgrade on a fault tolerant system by determining when functional code reaches a desired state before resuming an upgrade.
2. Description of Related Art
A storage system uses a storage controller to control a plurality of magnetic disks so that redundant information as well as information to be stored are stored in the magnetic disks in a distributed manner. For example, many controllers offer a wide variety of RAID levels such as RAID 1, RAID 5, RAID 0+1 and many other algorithms to ensure data availability in the event of the failure of an individual disk drive. In this case, the hosts do not see devices that correspond directly to the individual spindles; rather the controller presents a virtual view of highly available storage devices to the hosts called logical devices. Accordingly, when one of the magnetic disks fails, the storage controller can recover the information in the failed magnetic disk according to the redundant information. Then, a normal operation can be performed again.
In addition, a storage controller may be configured with a plurality of storage clusters, each of which provides for selective connection between a host computer and a direct access storage device and each preferably being on a separate power boundary. Each cluster might include a multipath storage director with first and second storage paths, a cache memory and a non-volatile storage (“NVS”) memory.
In most of today's storage products, usually two or more controllers are used to provide redundancy. This redundancy is necessary is to prevent interruption of service in case of a software or hardware failure on one of the controllers. In addition, this redundancy becomes very handy when providing new software updates.
However, during the upgrade there is always the possibility that the functional-code might “misbehave” and initiate an unexpected role transition due to the underlying fault tolerant system. For example, the functional code can initiate a failover and/or failback such that a fully operational system transitions into a single operational node without any regard to the current code-load process. A failover occurs when one controller relinquish its duties to the other controller while maintenance is performed on itself. A failback occurs when maintenance is completed and the controller is ready to regain control of its duties. The system may resumes dual node operation upon failback. Having these two independent and sometime conflicting threads of operations will cause the current code-load to fail.
Having a code-load process that can sustain such unavoidable incidents, and carry out the code-load despite such occurrences, to a successful completion, will result in a higher success concurrent-code-load ratio and less support cases and expenses.
One possible solution is to have the functional-code communicate its state transition to the code-load process. However, such a mechanism would be rather complex as well as error prone. To avoid such complexity, the code-load may be simply re-initiated at a later time. Nevertheless, merely waiting to retry the load later does not guarantee success. The system may discover another error and perform another error recovery, which could lead to another code upgrade failure and consequently another delay waiting until a later time to retry the code-load.
It can be seen that there is a need for a method, apparatus and program storage device for performing fault tolerant code upgrade on a fault tolerant system by determining when functional code reaches a desired state before resuming an upgrade.