A typical data storage system stores and retrieves data for one or more external host devices (or simply hosts). Such a data storage system typically includes processing circuitry and a set of disk drives. In general, the processing circuitry performs load and store operations on the set of disk drives on behalf of the hosts, e.g., block I/O operations using SCSI communications, ESCON communications, Fibre Channel signals, etc.
On occasion, the data storage system may require servicing by a technician. To this end, the technician typically goes to the location where the data storage system resides, and performs a service procedure on the data storage system. For example, the system may require a hardware or software upgrade in order to integrate a design improvement or to fix a design defect. As another example, a circuit board of the processing circuitry or a disk drive may fail and require replacement.
To assist the technician in performing such service procedures, some data storage system manufacturers provide scripts that automate the servicing process. That is, in response to a few electronically entered commands (e.g., instructions typed into a data storage system console device by the technician), the scripts perform a more-detailed and more-complex series of operations. As a result, without extensive knowledge of low-level aspects of the data storage system, the technician can perform a variety of service operations on the data storage system such as upgrading hardware or software, or replacing a defective data storage part by simply providing a few commands (e.g., typing information at a keyboard) and performing some physical work (swapping a failed component with a new component).
For example, suppose that a disk drive of a data storage system fails. A technician can travel to the data storage system and, at the console device of the data storage system, run a conventional script that guides the technician through a disk drive replacement procedure in an automated manner. For one conventional type of data storage system, the script first requires the technician to identify a spare disk drive for use in recovering data on the failed disk drive. After the technician identifies the spare disk drive, the script performs a data recovery procedure to recover the data. Such a recovery procedure may simply involve copying data from a mirror disk drive to the spare disk drive or, alternatively, involve more extensive data recovery operations (e.g., performing a series of logical XOR operations to recover data from related data and parity information). After the data is restored onto the spare disk drive, the script directs the technician to physically remove the failed disk drive and replace it with a new disk drive. After the technician physically replaces failed disk drive with the new disk drive, the script checks the new disk drive to make sure it has an appropriate size (e.g., that the new disk drive is at least as large as the failed disk drive). Next, the script copies the recovered data from the spare disk drive to the new disk drive. Once the data resides on the new disk drive, the script gives back the spare disk drive so that it can be used for other purposes and the disk drive replacement process is complete.
The technician can perform other types of service procedures using other conventional scripts that automate those service procedures in a manner similar to that described above for replacing a disk drive. Other examples of conventional script-driven service procedures include those for upgrading hardware or replacing failed hardware (e.g., circuit boards, etc.) and those for upgrading software (e.g., operating systems, device drivers, application level programs, etc.).
Unfortunately, there are deficiencies to using the above-described conventional scripts that automate servicing processes. For example, such scripts typically expect a service procedure to complete successfully, or if stopped before completion, to be restarted from the beginning. However, many conventional service procedures can fail in the middle leaving the data storage system in an intermediate state. When in such a state, the service procedure may not work properly if restarted because the service procedure may needed certain parameters of the data storage system to be at certain values which have since changed to values that will cause the service procedure to operate improperly.
For example, suppose that a technician travels to a customer site to replace a bad disk drive of a data storage system. Upon arrival suppose that the technician boots the console device of the data storage system and invokes a disk drive replacement script which is designed to enable the technician to (i) allocate an available spare disk drive and recover data onto the spare disk drive (e.g., copy data from a disk drive that mirrors the failed disk drive), (ii) replace the failed disk drive with a new disk drive, (iii) subsequently transfer the recovered data from the spare disk drive to the new disk drive, and (iv) finally return the spare disk drive to its initially available condition.
The technician may arrive at the customer site and successfully recover the data of the failed disk drive onto a spare disk drive. The technician may then replace the failed disk drive with a new disk drive. If the new disk drive works properly, the technician can then transfer the recovered data to the new disk drive and then return the spare disk drive to complete the service procedure.
However, suppose that the new disk drive was itself defective, i.e., another failed disk drive. Further suppose that the technician does not posses another new disk drive to swap in place of the faulty new disk drive. In this situation, the technician typically leaves the data storage system with the replacement procedure running, and travels back to the office to retrieve another new disk drive. In the meantime, the data storage system may reboot the console device since some data storage systems are programmed to reset a component (e.g., the console device) if there has not been any activity from that component after a predetermined period of time (e.g., 30 minutes).
When the technician returns with the new disk drive, the technician finds that the console device has been rebooted and that the script for replacing a failed disk drive terminated in the middle. If the technician restarts the script, the script would operate improperly. In particular, the script would start at the beginning and require the technician to allocate a spare disk drive. Unfortunately, the technician cannot allocate the initially used spare disk drive since it is already allocated. Furthermore, if a second spare disk drive is available and the technician allocates the second spare disk drive, the data storage system would then have two allocated spare disk drives.
At this point, a typical next step for the technician is to call the home office by telephone, and obtain technical assistance from a specialist such as someone with intimate knowledge of the disk drive replacement process. The specialist would provide detailed instructions that enable the technician to complete the disk drive replacement process by hand (i.e., without further using the script). In particular, the specialist would explain to the technician how to manually replace the second faulty disk drive with the new disk drive. The specialist would then explain how to transfer the recovered data from the spare disk drive to the new disk drive. Finally, the specialist would explain how to return the spare disk drive to an available state in order to manually complete the disk drive replacement procedure.
In some situations, the specialist may not be trained well enough to properly guide the technician through a servicing procedure. In such a situation, the technician may need to talk directly with an engineer. In these situations, the engineer is taken away from attending to other important such as designing new products.
Additionally, the specialist or engineer guiding the technician through completion of the procedure may forget particular steps. For example, the specialist or engineer may forget to ask the technician to verify that the new disk drive is at least the same size as the failed disk drive. If such a verification has not taken place, the technician may have inadvertently replaced the failed disk drive with a new disk drive that is too small. Accordingly, and perhaps after the technician has left the customer location and deemed the data storage system to be properly fixed, an application running on the data storage system may fill up the new disk drive expecting there to be more disk space than what is actually there. Such occurrences would require the technician to return to the customer location to diagnose and fix the problem thus increasing the servicing cost, as well as potentially lose goodwill and result in a reputation for lower quality due to the amount of trouble encountered by the customer.
In contrast to the above-described conventional scripts that attempt to automate the technician""s servicing processes but which must either successfully complete, be restarted, or require a technician to manually complete if interrupted in the middle, an embodiment of the present invention is directed to techniques for accessing a data storage system (e.g., upgrading hardware or software, replacing a defective component, etc.) using a maintenance procedure that, if aborted prior to completion and after the data storage system transitions to a particular state, can restore the data storage system back to an earlier state and complete the maintenance procedure. Accordingly, a technician running the maintenance procedure does not need to either manually finish accessing the data storage system or begin the maintenance procedure from scratch. Rather, the technician can simply complete the maintenance procedure from the earlier state thus avoiding the need for taking special action (e.g., telephone assistance, manual completion, etc.).
One arrangement of the invention is directed to a data storage system that includes a data storage assembly which is capable of storing and retrieving data; and a service processor, coupled to the data storage assembly, that accesses the data storage assembly. The service processor has a memory, and a controller coupled to the memory. The controller is configured to perform part of a maintenance procedure on the data storage system such that a state of the data storage system transitions from a first state to a second state, and store, in a memory, a data structure identifying the second state. The controller is further configured to, after the maintenance procedure is aborted prior to completion of the maintenance procedure and after a transition of the state of the data storage system from the second state to a third state, (i) restore the data storage system to the second state based on the data structure stored in the memory, and (ii) complete the maintenance procedure.
In one arrangement, the controller of the service processor is further configured to, prior to performing the part of the maintenance procedure, search for the data structure in the memory to determine whether the maintenance procedure previously aborted. Accordingly, the controller can determine whether the maintenance procedure terminated without completing based on whether it finds the data structure in the memory.
Preferably, the maintenance procedure includes multiple routines. In one arrangement, the controller, when completing the maintenance procedure, is configured to (i) receive, from a user, an individual run command that identifies a routine of the maintenance procedure, and (ii) in response to the individual run command, individually run the routine of the maintenance procedure identified by the individual run command. In one arrangement, the maintenance procedure is configured to (i) receive, from a user, a skip command that identifies a routine of the maintenance procedure, and (ii) in response to the skip command, bypass the routine of the maintenance procedure identified by the skip command. In one arrangement, the maintenance procedure is configured to (i) receive, from a user, a continue command that identifies a routine of the maintenance procedure, and (ii) in response to the continue command, perform at least one of the routines of the maintenance procedure such that the last routine performed is that which is identified by the continue command. In one arrangement, the maintenance procedure is configured to (i) receive an undo command from a user, and (ii) in response to the undo command, return the data storage system to the first state based on the data structure saved in the memory. In one arrangement, the maintenance procedure is configured to save in the memory for each routine of the maintenance procedure: (i) a respective identifier that identifies that routine, (ii) a respective set of runtime variables utilized by that routine when executed, and (iii) a respective set of control variables that identifies how that routine operates relative to other routines of the maintenance procedure.
In one arrangement, the service processor further includes an input/output device coupled to the controller. The controller is further configured to display, in a graphical user interface on the input/output device, a hierarchical representation of portions of the data structure to enable a user to navigate among the hierarchical representation in order to access (i) the respective identifier, (ii) the respective set of runtime variables and (iii) the respective set of control variables for each routine of the maintenance procedure.
The features of the invention, as described above, may be employed in data storage systems, devices and methods and other computer-related components such as those manufactured by EMC Corporation of Hopkinton, Mass.