1. Field of the Invention
The present invention relates to a technology of collecting operation information when trouble occurs in a disk array device.
2. Description of the Related Art
In recent years, in information-oriented society where data volume to be handled is increasing day by day in accordance with the development of information infrastructure, the realization of an information system with high reliability and high availability is demanded. In order to realize such an information system, a disk array device allowing constant access and backup of a large volume of data is rapidly coming into wide use. A disk array device has large-capacity storage devices constituted of a plurality of hard disks (magnetic disk devices) and it reads/writes data from/to the hard disks in response to a request from a host computer and the like.
In accordance with this rapidly increasing use of disk array devices, a larger number of device components are mounted on a disk array device with remarkably improved performance, and these components are complicatedly associated with one another. Accordingly, when trouble occurs in the disk array device, it takes an enormous amount of time and labor to determine a location of its cause and to recognize the range of its influence.
Relating to a trouble countermeasure of a storage system such as a disk array device, for example, the following technologies have been proposed. In one technology, a control part controlling disks is duplexed, and when one of the control parts detects abnormality, the control part having the abnormality performs processing for determining its cause and the other control part takes over processing, thereby avoiding the interruption of regular processing (for example, see Japanese Patent Application Laid-open No. 2002-7077). In another technology, when trouble occurs, log data that is obtained immediately before the acquisition of the latest dump data is transferred from dump data to reserved volume, thereby restoring the contents of a reserved disk (see, for example, Japanese Patent Application Laid-open No. Hei 11-102262).
FIG. 5 is a block diagram showing a hardware configuration example of a disk array device 10.
As shown in FIG. 5, the disk array device 10 includes, as its components, CMs (Centralized Modules) 40, 42, 44, DAs (Device Adapters) 50, 52, and hard disks 20, 22. All the components have multiplexed structure, so that even if one of the components has trouble, the storage system is capable of continuing its operation. The disk array device 10 shown in FIG. 5 is only one example, and the number of the CMs, DAs, and hard disks mounted on the disk array device is arbitrary.
The CMs 40, 42, 44 are modules managing and controlling the whole storage system (disk array device 10). The CMs 40, 42, 44 construct redundant configuration, and the plural CMs 40, 42, 44 are capable of operating in parallel. Further, the CMs 40, 42, 44 are capable of communicating messages necessary for host input/output (I/O) control and device maintenance control with one another.
The DAs 50, 52 perform interface control between the CMs 40, 42, 44 and the hard disks 20, 22, and they actually control to read/write data from/to the hard disks 20, 22. The DAs 50, 52 construct redundant configuration, and the plural DAs 50, 52 are capable of operating in parallel. Further, the DAs 50, 52 are connected to the CMs 40, 42, 44 via a bus 76, so that messages necessary for host I/O control, device maintenance control, and the like are communicatable between the CMs and the DAs.
The hard disks 20, 22 store data such as host I/O data. The plural hard disks 20, 22 construct RAID (Redundant Array of Inexpensive Disks) to maintain redundancy.
To read/write data from/to the hard disks 20, 22 by the CMs 40, 42, 44 are executable by the CMs 40, 42, 44 giving a message request to the DAs 50, 52 through a communication driver.
A host computer 30 is connected to the disk array device 10 via multiplexed host interfaces 90, 91, 92 and it performs data access, data backup, and the like to the disk array device 10.
Note that in the description below, the CM 42 is assumed to be a master CM managing all the CMs mounted on the disk array device 10, and the CMs 40, 44 are assumed to be slave CMs managed by the master CM 42. Further, for descriptive convenience, the operations and so on performed when trouble occurs will be described on assumption that abnormality occurs in the slave CM 40.
FIG. 6A and FIG. 6B are sequence diagrams showing processing operations when abnormality occurs in the CM 40 in the disk array device 10 shown in FIG. 5. Note that in the sequence diagrams shown in FIG. 6A and FIG. 6B, the hatched rectangles represent that corresponding functional parts are performing some operations (the same applies to sequence diagrams to be shown later).
As shown in FIG. 6A, it is assumed that an unassumed operation or the like such as 0 (zero) division or unauthorized address access is executed in a program executed by the CM 40, and the CM 40 detects its own abnormality (S101). At this time, the CM 40 stores CM operation information data for later trouble analysis in a not-shown nonvolatile memory (NVRAM) or the like mounted on the CM 40 itself (S102).
In order to prevent the abnormality from giving influence to host I/O control, device maintenance control, and so on of the storage system, the CM 40, after storing the CM operation information data in the nonvolatile memory, terminates all controls and do not perform any control until it is separated from the storage system (the other CMs 42, 44).
Further, as shown in FIG. 6B, when the master CM 42 having no abnormality senses that the CM 40 has abnormality, based on a patrol operation intended for mutual monitoring among the CMs 40, 42, 44 (S151), the master CM 42 notifies the abnormality of the CM 40 to the other slave CM 44 having no abnormality (S152). The CM 44 receiving the notification senses the abnormality of the CM 40 based on this notification to immediately terminate controls that have been executed according to messages that were received from the CM 40 having the abnormality before it sensed the abnormality and to discard all messages thereafter sent from the CM 40 having the abnormality (S153).
The master CM 42 notifies the DAs 50, 52 of the abnormality of the CM 40 (S154). The DAs 50, 52 receiving the notification, similarly to the CM 44, senses the abnormality of the CM 40 based on this notification to immediately terminate controls that have been executed according to messages that were received from the CM 40 having the abnormality before it sensed the abnormality and to discard all messages thereafter sent from the CM 40 having the abnormality (S155).
Thereafter, the master CM 42 immediately separates the CM 40 having the abnormality from the storage system in order to prevent the storage system from being influenced by the CM 40 having the abnormality, for example, from writing broken host I/O data to the hard disks 20, 22 (S156).