1. Field of the Invention
The present invention relates to a storage management process, a storage management apparatus, and a computer-readable medium storing a storage management program for managing data in a plurality of storage nodes in a distributed manner.
2. Description of the Related Art
In the disk drives, failure can occur in a block or sector even during operation as well as during installation. When a block of a disk in which initially data are correctly written becomes defective, the data written in the bad block cannot be read out (i.e., the data loss occurs) in the worst case. (The data loss means that data stored in a storage system is lost. Therefore, for example, in the case where identical data is doubly stored in two different storage nodes, the data loss does not occur even when the data stored in one of the two storage nodes is lost as long as the data stored in the other of the two storage nodes remains.)
As mentioned above, the bad block is a serious risk factor in data preservation. In order to detect a bad block, it is necessary to actually access the bad block. In some types of RAID (Redundant Array of Independent Disks) systems, a RAID controller periodically checks mediums in disk drives. When the RAID controller detects a bad block, the RAID controller acquires the data stored in the bad block, on the basis of data stored in another disk drive, and writes the acquired data in the disk drive having the bad block. Thus, the data stored in the bad block is written in a region which is different from the bad block and called an alternative block. Thereafter, when a request for access to a region in the bad block occurs, the RAID controller accesses a block as an alternative to the bad block, so that it looks as if the RAID system has no bad block and the data are stored in the initial positions. The writing in the alternative block is disclosed in, for example, “Patrol Function Enhancing Data Reliability,” published in Japanese on the Internet by Fujitsu Limited at the URL “http//:storage-system.fujitsu.com/jp/products/iadiskarray/feature/a01/” and searched for by the applicant on Aug. 31, 2006.
In addition, troubles in a disk device can cause unintentional writing of data in a position different from an original position in which the data is to be written. In this case, the data is also lost although data reading operations can be performed, as distinct from the case of the bad block.
In a system which has been proposed as a countermeasure against the above data loss, data are redundantly and distributedly stored over multiple computers (nodes). When a failure occurs in a node in the above system, it is possible to restore data stored in the failed node on the basis of data stored in another node, for example, as disclosed in Japanese Unexamined Patent Publication No. 2000-076207, paragraph No. 0046.
However, according to the technique disclosed in Japanese Unexamined Patent Publication No. 2000-076207, an operation for restoration of a node is performed after a failure occurs in the node. Therefore, the reliability of the system is lowered during the operation for restoration. That is, when a failure occurs in a node, data stored in another node is accessed during the operation for restoration. Therefore, it is unnecessary to stop the service. However, the data redundancy is not regained until the failed node is restored, so that the system reliability is lowered during the operation for restoration. In addition, since it takes a long time to restore the entire node, it is necessary to use the system with the lowered reliability for the long time.
In the above circumstances, it is desired to detect a sign of the data loss in a system in which data are redundantly and distributedly stored over multiple nodes, and remove the cause of a data loss before data is actually lost. Further, even when a data loss has already occurred, it is necessary to perform an operation for restoring the node on a per-data access basis.