In enterprises and data centers, storage devices including plural disks are used to store a large amount of data. Recently, in addition to transaction processing systems such as conventional sales management, stock control, and financial management, new businesses such as electronic commerce systems by use of the Internet and systems for WEB service have been increasing, and storage capacity required to store data produced as a result has been increasing year by year. Therefore, the storage devices are required to be capable of continuous addition of disk storage devices to provide for increasing data capacity. Since halting a main transaction processing system of an enterprise especially causes a great loss, a storage device used in the system is required to be highly reliable enough to minimize the influence of failures. As storage devices meeting these requirements, a variety of disk array devices such as described in patent document 1 and patent document 2 are marketed from storage device venders.
The configuration of a conventional storage device will be described with reference to FIG. 9.
The storage device 901 is connected to a computer 902 that makes data access to the storage device 901. The storage device 901 includes a disk storage device part 910, and disk controllers 9041 and 9042. The disk storage device part 910 comprises plural disk storage devices.
The disk controller 9041 and the like, which control data access between the computer 902 and the disk storage devices 910, include cache memories 9071 and 9072, and microprocessors (MP) 9061 and 9062. The cache memory 9071 and the like temporarily store data to speed up data access. The microprocessor 9061 and the like manage the cache memory 9071 and the like, control communication between the computer 902 and the disk controller 9041 and the like, control communication between the disk controller 9041 and the like and the disk storage device part 910, and manages the storage of data to the disk storage device part 910.
Between the computer 902 and the disk controllers 9041 and 9042, an interface card (IO) 903 within the computer 902 is connected with host adapter cards (HA) 9051 and 9052 of the disk controller 9041 and the like through an optical fiber or metallic cables 9111 and 9112, and data is transmitted and received between them using protocols such as fiber channel and SCSI. On the other hand, between the disk controllers 9041 and 9042 and the disk storage device part 910, data is transmitted and received using protocols such as, e.g., fiber channel and SCSI through disk adaptor cards (DA) 9081 and 9082, and an optical fiber or metallic cables 9121 and 9122.
Next, a description is made of how the computer 902 accesses the storage device 901. When the computer 902 accesses (data write or data read) data stored in the storage device 901, the computer 902 selects one (9041, for example) of the two disk controllers 9041 and 9042 and accesses the disk storage device part 910 through the disk controller 9041. The accessed data is temporarily stored in the cache memory 9071. Since the computer 902 accesses a semiconductor memory having a higher data access speed than the disk storage device part 910, data access speed improves.
Here, the data of the cache memory 9071 is copied to the cache memory 9072 of the other disk controller 9042, through an inter-memory channel 909 (mirroring). Since the disk controllers are thus duplicated, even if a failure occurs in one of the disk controllers, data access can be made to the other, providing high reliability for the storage devices.    [Patent document 1] U.S. Pat. No. 6,438,647    [Patent document 2] U.S. Pat. No. 6,330,642
FIG. 10 is a sequence diagram showing a failure recovery procedure when a failure occurs in a disk controller 1 (9041, for example) in a conventional storage device (FIG. 9). When a failure occurs in the disk controller 1 (S1001), between the two disk controllers, a disk controller 2 (9042, for example) recognizes the failure by a failure monitoring method such as heart beat (S1002). The disk controller 2 notifies the computer 902 of the occurrence of the failure in the disk controller 1 (S1004), and prohibits a new access (S1005, S1006). At this point, cache data (not-yet-written cache data) on the cache memory of the disk controller 2 which is not yet written to the disk storage device part 910 is not duplicated. Therefore, when a failure occurs in the disk controller 2 as well, the not-yet-written data cannot be restored. This is referred to as data loss that poses a serious problem in terms of reliability. As failure recovery processing for preventing the data loss, the disk controller 2 immediately writes the not-yet-written cache data to the disk storage device part 910 (S1007, S1008).
However, writing to the disk storage device part 910 requires more time compared with writing to the cache memory 9071 and the like, which are semiconductor memories. Therefore, access prohibition period FT for the storage device becomes long, exerting a serious influence on operations.
After writing the not-yet-written cache data to the disk storage device 910, the disk controller 2 permits a new access to the computer 902 (S1009). The computer 902 starts data access to the disk controller 2 (S1011) The disk controller 1 in which the failure occurred performs failure recovery by resetting or making replacement (S1012). The disk controller 1 notifies the disk controller 2 of failure recovery (S1013).
The disk controller 2 recognizes this event (S1014), and a duplex system consisting of the two disk controllers 1 and 2 is configured to return to a normal state (S1015, S1016). As a result, the system recovers to the normal state, and data access speed is increased by temporarily storing data in the cache memory during data write access.
However, in the period ST during which the system returns to the normal state (S1015, S1016) after a new access is permitted (S1009), data access is made only to the disk controller 2 because the disk controller 1 in which the failure occurs is not operating. Accordingly, for data write access, each time access is made, data must be written to the disk to prevent data loss. As a result, the data access speed in the period ST decreases greatly.
For reasons other than failure occurrence such as maintenance work, change of device configuration may be required. However, halting a system for each occurrence of such an event causes a great loss, especially in the case of a main transaction processing system of enterprises.