In recent years, various kinds of information are digitalized with development and spread of computers. As a device for storing such digital data, there is a storage device such as a magnetic tape and a magnetic disk. Since the amount of data to be stored increases day by day and reaches a huge amount, a mass storage system is required. Moreover, as well as reduction of the cost spent for a storage device, reliability is also required. In addition, it is also required that data can be easily retrieved later. As a result, a storage system that can automatically realize increase of storage capacity and performance, eliminates duplicated storage to reduce a storage cost and has high redundancy is desired.
Under such a circumstance, in recent years, as shown in Patent Document 1, a content address storage system has been developed. This content address storage system distributes and stores data into a plurality of storage devices and, by a unique content address specified depending on the content of the data, specifies a storage position where the data is stored.
To be specific, a content address storage system divides predetermined data into a plurality of fragments and adds a fragment as redundant data, and then stores these fragments into a plurality of storage devices, respectively. Later, it is possible to designate a content address to retrieve data, namely, fragments stored in a storage position specified by the content address and restore the predetermined data before being divided from the fragments.
Further, the content address is generated so as to be unique depending on the content of data. Therefore, in the case of duplicated data, it is possible to refer to data in the same storage position and acquire data having the same content. Consequently, it is unnecessary to store the duplicated data separately, and it is possible to eliminate duplicated recording and reduce the data capacity.
Further, in the storage system as described above, when a failure occurs in a storage node that stores data and the storage node is separated from the system, components on the storage node are regenerated on other storage nodes. That is to say, because the storage system described above divides predetermined data into a plurality of fragments and adds a fragment as redundant data thereto, it is possible even if a predetermined fragment among the fragments is lost to restore the data based on the other fragments.
Here, with reference to FIGS. 1 and 2, a process of regenerating data stored in a storage node when a failure occurs in the storage node will be described.
At first, as shown on the upper side of FIG. 1, in a storage system 300 equipped with a plurality of storage nodes 401 to 404, fragment data obtained by dividing storage target data are distributed and stored into respective components 1 to 12 formed on the storage nodes 401 to 404, respectively. When a predetermined storage node goes down in this state, a process of regenerating lost fragments based on the fragments stored in the remaining storage nodes is immediately started.
To be specific, in the regeneration process, firstly, the components 10, 11 and 12 storing data formed on the down storage node 404 are regenerated on the operating storage nodes 401 to 403 as shown on the lower side of FIG. 1. Then, as shown on the upper side of FIG. 2, by loading the fragments 1 to 9 stored in the operating storage nodes 401 to 403, regenerating data D stored in the down storage node 404 based on the data and dividing the data D again, the lost fragments are regenerated. After that, as shown on the lower side of FIG. 2, the regenerated fragments are distributed and stored into the newly generated components 10, 11 and 12, namely, into the operating storage nodes 401 to 403, respectively. Until these processes are completed, part of the data cannot be accessed.
Further, in the storage system as described above, when a storage node disconnected from the system recovers, recovery of data from the other storage nodes to the recovered storage node is immediately started. Here, with reference to FIG. 3, a data recovery process when a node recovers will be described.
At first, when the down storage node 404 recovers, the components 10, 11 and 12 having belonged to the recovered storage node 404 are returned to the original positions as shown on the upper side of FIG. 3 and, after that, transfers the data to the recovered storage node 404 from the storage nodes 401 to 403 as shown on the lower side of FIG. 3.
Because fragments having been stored before the recovered storage node 404 goes down already exist in the components returned to the storage node 404 in the state shown on the upper side of FIG. 3, it is enough to transfer only data newly stored into the components generated in the other storage nodes after the storage node has gone down. Therefore, the data in the storage nodes 401 to 403 are compared with the data in the storage node 404, and only a difference therebetween is transferred. For example, for lessening the process of comparing the data, only metadata configured by hash values or the like of the data are compared.
Accordingly, by using the fragments existing in the recovered storage node 404 and transferring only fragments of the data newly written in while the storage node 404 is down, it is possible to omit unnecessary data transfer.
When the down storage node recovers before the fragment regeneration process started because the storage node has gone down is completed, regeneration of the data to the other storage nodes is not all completed. However, since the data fragments being regenerated have originally existed in the down node, the recovery process is not influenced. Moreover, when loading of data is requested before completion of data transfer from the other storage nodes executed when the down storage node recovers, it is enough to load the fragments from the storage nodes of transfer destinations.    [Patent Document 1] Japanese Unexamined Patent Application Publication No. 2005-235171    [Patent Document 2] Japanese Unexamined Patent Application Publication No. 2008-204206
However, in the storage system described above, when one of the storage nodes is disconnected, data in the storage node is immediately regenerated in the other storage nodes regardless of the cause of the disconnection or a prospect for recovery of the status, and hence, the system has heady load. Moreover, when the down storage node recovers, a process of recovering data from the other storage nodes is unconditionally performed, and hence, the system has load.
To be specific, even when a storage node cannot be viewed as a result of a predictable operation such as maintenance, data regeneration is started after the operation starts, with the result that the system has load and the performance thereof deteriorates. Moreover, when a failure occurs simultaneously with a time that the data regeneration is performed, there is a risk of occurrence of data loss. Furthermore, even when a storage node is, for example, restarted because of a temporary cause and the storage node cannot be viewed, recovery of data stored in the storage node is immediately performed, with the result that the system has load and the performance thereof deteriorates. Besides, when a failure such as frequent restart of a storage node due to hardware malfunction or the like occurs, the regeneration process, the data recovery and so on are repeatedly and frequently performed, with the result that a problem that the system becomes unstable arises.