1. Field of the Invention
The invention relates generally to storage systems and more specifically relates to methods and structure for providing near-live reprogramming of firmware for storage controllers operating in a virtual machine (i.e., hypervisor) environment.
2. Discussion of Related Art
Storage systems have evolved in many respects over decades. Modern storage systems provide not only redundancy (e.g., RAID storage management techniques) but also frequently incorporate higher level storage services. For example, many present day storage systems provide virtualization services within the storage system (above and beyond RAID storage management) to permit further hiding of the mapping from logical storage volumes to physical storage devices and locations. Still further, many present day storage systems provide services for automated backups, de-duplication, replication (e.g., “shadow copies”), etc.
As storage systems have evolved to provide more services, firmware (programmed instructions for execution by a processor and/or customized, reprogrammable logic circuits) has grown in complexity. It is common that the firmware in such sophisticated storage systems may require reprogramming from time to time. Such reprogrammed firmware may provide bug fixes and/or feature upgrades as compared to a current version of the firmware. In some circumstances, the firmware may be reprogrammed to return to an older version of firmware due to bugs or problems in a newer version. Typically, as presently practiced, such reprogramming of firmware requires that the storage system be taken “offline” for a period of time to perform the firmware reprogramming. While offline, host systems may be incapable of accessing the data on the storage devices of the storage system and incapable of adding new data to be stored in the storage system.
In some high reliability storage system environments it may be unacceptable to permit the storage system to be offline. Most such high reliability storage systems provide for redundant storage controllers to help assure continuous access to the stored data. The redundant controllers provide for a backup controller to assume control of processing host system requests in case of failure of the presently active controller. To further enhance reliability, the host systems may also provide for redundant communication paths to each of the multiple redundant storage controllers of the storage system. A backup communication path may be used if a primary communication path appears to have failed. In storage systems with such redundant architectures, the host systems typically incorporate some form of “multi-path” driver software so that each host can direct I/O requests over an appropriate path and re-direct an I/O request to another path in case of failure of a communication link or of a controller.
In a redundant controller environment, a firmware reprogramming process may be performed by cooperative processing between the controllers such that one of the controllers informs the other controller that it will be offline for a period of time and that the other controller should assume responsibility for processing host requests as the first controller's firmware is reprogrammed. After the first controller's firmware is successfully reprogrammed, it may inform the other controller that it should perform its firmware reprogramming while the first controller assumes responsibility for processing I/O requests. Eventually both controllers will be back online and the system will continue normal processing using the reprogrammed firmware. Such known reprogramming processes eliminate the offline status of the storage system but at the cost of reduced performance and/or reliability while one of the redundant controllers is “offline” during its firmware reprogramming process.
The above summarized prior approach relies on the host systems' multi-path driver software to make appropriate switches between the redundant communication paths and redundant controllers at appropriate times during the firmware replacement process. “Multi-path” driver software typically resides in host systems to keep the selection of a particular communication path to a particular storage controller of the storage system transparent to the host system applications. The multi-path driver receives application I/O requests from applications and routes the requests to a selected storage controller of the redundant controllers via a selected path of the redundant communication paths to the selected controller. This reliance on the multi-path driver to perform appropriate processing at the appropriate time during the firmware reprogramming can give rise to problems. Ideally, a first controller will be able to go through steps of transferring responsibility for its volumes, reprogramming its firmware, and reacquiring control of its volumes before the host systems' multi-path drivers times out. If this is the case, there is a relatively brief period of time while the host systems are retrying I/O requests to the first controller when suddenly the first controller will start processing requests and responding again—using the new firmware version. However, the time required for reboot of the first controller (following reprogramming of its firmware) may be significant in some storage systems because the controller needs to do a full discovery of the back-end storage device network (e.g., SAS discovery, etc.) including expansion trays, etc. This reboot processing of the controller can take so long that it triggers failover processing by the multi-path driver of the host system to utilize an alternate path between the host and the storage system. In a worst case scenario of this kind where the multi-path driver times out its preferred path while the first controller is in the process of reprogramming its firmware, the first controller and the multi-path driver may tear down all the data structures they had set up for the storage system via this apparently failed path. The host multi-path driver will then attempt to access the logical volumes on the storage system using the alternate path (e.g., the other controller). The multi-path driver will then try to re-create the data structures for the alternate path to the other controller. In some cases, it may take the multi-path driver a long time to tear down the data structures relating to the first controller. The time required could be long enough that when it finally attempts to access the logical volumes using the other controller, that other controller has already begun transferring the first controller's logical volumes back to control of the first controller. The multi-path driver may then time out again, this time because of incorrect use of the alternate path to the other controller. In such a case the whole storage system may considered failed by the host system although in fact it is just reprogramming the firmware. In some worst cases of the scenario described above where the timing conditions repeat, the storage system may have entered into a “deadly embrace” that will require an administrator to take explicit steps in order to get the storage system operational. An administrative user would then have to take explicit, manual recovery action in this case to get the overall system back up and running.
Thus, it is an ongoing challenge to efficiently reprogram firmware in a storage system where redundant paths and controllers may be used in conjunction with multi-path driver software on attached host systems.