1. Field of the Invention
This invention relates to enterprise storage controllers, and more particularly to apparatus and methods for collecting data and recovering from failures in enterprise storage controllers.
2. Background of the Invention
In enterprise storage controllers, such as the IBM DS8000™ storage controllers, a “warmstart” may be used as both a debug mechanism and a recovery method. The warmstart may initiate a state-save operation to collect state-save data. This state-save data may contain data structures that are deemed necessary to analyze the problem or condition that necessitated the warmstart. In addition, the state-save data may contain component-specific microcode path traces and general traces. Component-specific traces may be saved in separate buffers. General traces, which record all the activity of each defined processor, may also be saved in separate buffers.
The warmstart itself may allow the storage controller to reinitialize certain data structures, rebuild linked lists, and restore the storage controller to a more known state. During the warmstart, there are periods of time where I/O may be blocked and therefore must be redriven. Accordingly, the duration of the warmstart should be as short as possible to minimize the impact on the storage system. This is particularly important in applications such as time-sensitive banking and reservation systems.
A warmstart may be automated for both debug and recovery when the storage controller detects certain conditions, such as microcode detected errors (MDEs) and microcode logic errors (MLEs). In certain cases, a warmstart may be initiated by host software or applications. For example, a host may send a state-save command to the storage controller when the host detects certain conditions that would warrant collecting data at the storage controller. Typically, after an application fails, the host will send a state-save command to the volume that was used by the application. In addition, a user can manually force a warmstart on the storage controller command line at any time.
When a warmstart occurs, the state-save data that is collected at a customer site may be transmitted to a customer support center. The state-save data may be very large, possibly hundreds of megabytes. As a result, it can take several hours for the state-save data to be successfully transmitted to the customer support center. Unfortunately, the time spent in transit may delay support and problem analysis. For a customer-critical problem, the problem may reoccur at the storage controller while the state-save data is in transit and before the customer support center has a chance to understand the problem or provide an action plan. For a customer, this situation can be very aggravating. The impact of the problem can also increase as the problem reoccurs.
Once the state-save data arrives at a customer support center, the data may be analyzed to determine the cause of the problem. Many times, only a small amount of the state-save data is needed to determine what caused the problem. Thus, much of the data may be unneeded, and much of the delay associated with the unneeded data may be unnecessary. Furthermore, much of the resources, time, and effort required to collect and transmit the state-save data may be wasted.
Many of the problems that occur in the storage controller are typically related to a host/device relationship. This is because a storage controller may be device-centric, meaning that a host (e.g., a open system or mainframe server) may access a specific device (e.g., a disk drive, a tape drive, etc.) to perform I/O operations (i.e., reads and writes) to that device. Consequently, only data directly related to the problem (e.g., data related to a host/device relationship, data related to a specific piece of hardware with a problem, data related to events surrounding an MDE or MLE, or the like) may be needed to analyze the problem. For example, data may be needed with respect to the state of device data structures, the data path used for the host-to-device communications, or the like. The data needed may include data structures for several components, where most or all of the data structures are related to a specific device/host relationship.
Other issues may also arise when a warmstart collects state-save data. For example, when a state save occurs, even though the problem is related to a specific host and/or device, all devices connected to the storage controller may become unavailable to all hosts attached to the storage controller. This means I/O activity is suspended for all hosts and all devices during the warmstart. This is a heavy penalty to pay for an isolated problem between a specific host and device.
It has also been observed that a warmstart recovery may fail for reasons unrelated to the original problem. For instance, since all I/O activity may be suspended during the warmstart, all data queued in cache may need to be destaged to the storage devices (e.g., disk drives). If the queue of data in the cache is huge, this may cause the warmstart to take longer than normal. In the worse case scenario, this delay may cause another warmstart. In such cases, the storage controller may be unable to recover on its own.
In view of the foregoing, what are needed are apparatus and methods to reduce the number and/or frequency of warmstarts that occur in enterprise storage controllers. Further needed are apparatus and methods to more closely tailor data collection and recovery processes on specific failures and host/device relationships, thereby allowing I/O to continue between hosts and storage devices unrelated to the problem. Yet further needed are apparatus and methods to reduce the amount of data collected in association with a failure condition, thereby reducing the time and resources needed to analyze the data or transmit the data to a customer support center.