Currently, there is known a distributed data storage delivery system. In the distributed data storage delivery system, user computers each having a storage device that stores data (hereinafter, also referred to as storage node) are connected to plural networks to form a large capacity data storage system. The distributed data storage delivery system has a function of arranging (storing) data, and a function of managing the arrangement of the data (distributed data arrangement management function).
With this configuration, the distributed data storage delivery system employs the distributed data arrangement management function to divide the data transmitted from the user computer into plural data fragments to make the data redundant, and to store the data into plural storage nodes. Then, the user computer connected to the same network uses the distributed data arrangement management function to identify a storage node that has target distributed data, and obtains the distributed data from the storage node.
The distributed data arrangement management function is realized by a metadata server having a centralized management function, or a distributed data index unit having an index function that employs a distributed management function using a distributed hash table. Therefore, in the distributed data storage delivery system, a large amount of data to be stored are stored in the storage nodes in a distributed manner. In such a system, the arrangement of the data largely affects the performance, failure resistance property or availability of the system.
For example, when a magnetic disk drive is used as the storage device, the access performance can be increased by storing two data to which continuous access is highly possible, into physically continuing areas on the same magnetic disk drive. Further, when a performance of the network connected between the storage nodes is low, the performance can be improved by storing data used at the same time into plural different storage nodes. However, when there exists an access that extremely deteriorates the performance, the system cannot respond to the large amount of access, which deteriorates the availability.
As described above, it is important to distribute and arrange the data in a manner that well accords with patterns of simultaneous access or continuous access, or a use tendency of the data such as how existence/absence of access changes over time.
Further, the distributed data storage delivery system may have a data re-arrangement function in which the data that have been once stored are transferred to another storage node through a dynamic data migration function, whereby the arrangement of the data can be changed.
Examples thereof include a case where another storage node is more conformable with the use of the user in terms of configuration of the network, and a case where plural data, which are used simultaneously, are read out in parallel by different storage nodes to improve the system performance.
These are performed using a function of transferring the data through the network, and a function of changing the registered data in an index function included in the distributed data arrangement management function.
However, access to the storage device is necessary at the time of performing the data re-arrangement function in the dynamic data migration function, and hence, it takes several minutes to several hours to complete the re-arrangement of data to an arrangement that is suitable for the data.
Further, in the distributed data storage delivery system, it is possible to increase or decrease the number of storage nodes in the system. For example, a storage node can be added to the distributed data storage delivery system in the case where the system lacks the data supplying ability, or lacks the data storage capacity. Further, when any of the storage nodes breaks down or the amount of data that the system deals with decreases, the distributed data storage delivery system can reduce the number of the storage nodes. Such a change in the system configuration can be made based on information indicating a configuration of the system that the distributed data arrangement management function has and change in the index function.
In the distributed data storage delivery system having the configuration described above, a large volume of data previously stored in another system may be inserted into the distributed data storage delivery system in the case where no data has been registered.
An example of inserting the data as described above includes a case where backup data are restored. First, the distributed data storage delivery system needs to generate, in another storage device (for example, a backup storage device), backup of the stored data in order to prepare for loss of data due to breakdown of the entire distributed data storage delivery system.
An example of the breakdown of the entire system includes trouble with a power source or building facility, software malfunction, and natural disasters. As the backup storage device, it is possible to use a tape device, a disk array, or another distributed data storage delivery system, for example.
It should be noted that, in a backup system for making the backup in the distributed data storage delivery system, it is necessary to make the backup of the entire system at a synchronized, stationary point. This is because there occurs delay in transmitting an instruction between the computers connected in parallel to a network, which makes it difficult to match the backup generation time in each node. Further, since the data are transferred between the nodes, part of the data may get lost or there may exist overlapping data if the backup generation times are shifted from each other between the nodes.
To deal with this, a data management unit that manages the data to be backed up stores a state called a snapshot representing a data set at a certain point consistent throughout the entire system. The snapshot can be generated, for example, by a method described in Patent Document 1. Then, data of the snapshot are transmitted to a backup storage device as the data to be backed up.
Further, Non-patent Document 1 describes a method of generating a snapshot to back up the data of the storage devices connected in parallel, and backing up data of the generated snapshot.
As one example of a method for storing a backup of data in the distributed data storage deliver system, there may be a method of transmitting data stored in each storage node to a predetermined backup storage device after generating a snapshot. In this method, it is necessary to manage the backup data or the backup device for each storage node, requiring a large workload of a manager.
Further, as another method, there may be a method of dividing data into data clusters with fixed lengths such as blocks and chunks or into semantically divided data clusters such as files, and transmitting the data clusters together with identifiers uniquely representing the respective data clusters to the backup storage device to store them.
Then, in the case where data are lost due to occurrence of trouble in the distributed data storage delivery system, the manager restores the data from the backup storage device to the distributed data storage delivery system after fixed or newly structured.
The backup data stored in the backup storage device are copied, for each of the data clusters divided at the time of storing, onto storage nodes in the distributed data storage delivery system after restoring. Each of the storage nodes, which are destinations of the copying, depends on a configuration of the distributed data storage delivery system after restoring, and the destinations of the copying are determined by a data arrangement management function in the distributed data storage delivery system after restoring.
A further copy of the data that have been copied onto the storage node may be arranged in another storage node. This copying is made to prevent the data from being lost due to trouble of the storage node, and is also determined by the data arrangement management function of the distributed data storage delivery system after restoring.
As yet another example, in the case where data in a conventionally operated system are transferred to a distributed data storage delivery system that is newly configured and has a high performance, a large volume of data are inserted from the old system to the new system.
In this case, the data stored in the old system are divided into data clusters with fixed lengths such as blocks and chunks or into semantically divided data clusters such as files, and are copied onto the new distributed data storage delivery system, together with identifiers uniquely representing the respective data clusters.