1. Field of the Invention
This invention relates to apparatus and methods for recovering from errors in multi-core storage-system components.
2. Background of the Invention
In many storage systems, particularly storage system that are used for critical databases, file sharing on networks, business applications, or the like, ensuring high availability is critical. For this reason, redundancy is built into many storage systems to ensure that if one or more hardware components fail, other hardware components are available to pick up the workload of the failed components and provide continuous availability. For example, the IBM DS8000® enterprise storage system includes multiple servers to ensure that if one server fails, the other server remains functional to enable I/O to continue between hosts and storage devices. This built-in redundancy helps to reduce the impact that component failures have on organizational operations.
In such storage systems, performance is also critical due to the high workloads. In order to increase performance, multiple cores may be built into various components of the storage systems. For example, the IBM DS8000® enterprise storage system may utilize host adapters and device adapters to communicate with host devices (e.g., open system and/or mainframe servers) and storage devices (e.g., disk drives and solid state drives) respectively. The processors of these host adapters and device adapters may include multiple cores to improve performance (e.g., increase throughput).
Unfortunately, such multi-core storage-system components may experience errors from time to time. Such errors may be manifest as either hardware or software errors. When such errors occur, the solution is typically to reset or restart the entire multi-core component, such as by performing a warmstart. Such a reset or restart temporarily removes the entire multi-core component (e.g., adapter) from service, thereby undesirably interrupting operations. If such errors persist, the interruptions to operations may become more frequent and severe.
In view of the foregoing, what are needed are apparatus and methods to more efficiently recover from errors in multi-core storage-system components. Ideally, such apparatus and methods will minimize the availability and performance impacts on the multi-core components.