In current data storage facility, such as storage servers, storage systems (or storage subsystems) and components require firmware updates. The process is commonly referred to as code load or code update. During that process, multiple components' firmware may require update. The code load process is usually performed when the component is in good operational condition. Therefore, before the code load, a set of pre-checks are run to ensure that the components are in good operational state.
Since there are multiple components involved, each component has its own pre-check. For example, if the code load process determined that it is going to update the storage controller, the disk enclosure, and the disks, then, the code load process will run the pre-check for each of these components before updating that component. However, if the pre-check fails for one component, such as the storage controller, then the entire code load process is suspended. Any components remaining to be updated are not updated. Here, “component” may not mean a single module, but a type of module in the storage system (e.g., all the disk enclosures in the storage system).
Traditional pre-check is conservative. When the pre-check finds a problematic module, the entire code load task will be suspended to prevent further damage to the storage systems. This technique is widely used in field because: (1) it is a widely accepted field support guideline that the engineers should repair the problematic module first, then perform code load to the storage systems; and (2) sometimes, the high level code load process cannot skip some of the modules and still update the rest of modules (otherwise, the code load becomes non-concurrent which means host access is interrupted). That is, the code load process can “update all the modules of a certain type” or “does not update any modules of a certain type”.
In some cases, a module may have a redundant module, and the redundant module may also need update. If the code load is not suspended, the redundant module may be reset during update, so the host access to the storage system is interrupted, which is a serious event in field.
Some customers are complaining that the suspend rate of code load is too high. And when the code load suspends due to hardware problems, the engineers need to order a new module, replace the old module, and then restart code load again. If this happens during a service window, the engineer can do nothing until the arrival of the new module. Usually, this means customers have to rearrange another service window for the code load.