In a distributed storage system, data items (or data) (referred to below as “object data items”) that are to be processed are stored in large amounts in a state in which the data items are distributed among a plurality of storage nodes. Here, a storage node refers to a storage apparatus or the like that stores the object data items.
In the distributed storage system, a large number of storage nodes store the object data items. In this case, a method of managing handling of the storage nodes and the object data items stored by the storage nodes is a problem. Heretofore, a method of managing the handling of the storage nodes and the object data items stored by the storage nodes by a small number of fixed management nodes has been proposed. However, with this method there are problems such as a tendency for bottlenecks to form in the management nodes and a lack of scalability. In order to overcome these problems, a system that manages the handling with a large number of management nodes is proposed. According to the system in question, since processing to control storage nodes that store object data items is not centralized in one or a small number of fixed management nodes, it is possible to more efficiently perform distributed storage of data in a large number of storage nodes.
In a case of a system that does not possess a fixed management node to perform centralized control, it is difficult to record and manage, in a table, storage nodes that store object data items. This is because a table that records storage nodes that store the object data items becomes very large as the number of object data items increases, and it becomes difficult to synchronize and manage the large table with a large number of management nodes.
Therefore, a method is proposed by which storage nodes to store the object data items are determined by an algorithm. By determining storage nodes to store object data items by an algorithm, based on data information such as object data item identifiers and the like, it is no longer necessary to synchronize a large table by a large number of management nodes, and it is possible to handle the large number of object data items.
Patent Literature 2010-184109 (referred to below as Patent Literature 1) and Non Patent Literature (NPL) 1 and Non Patent Literature 2, as described above, propose a method of storing object data items in a large number of storage nodes, without using fixed management nodes that perform centralized control of the entire system.
FIG. 8 is a diagram showing an example of a system configuration of a distributed storage system related to the present invention. The distributed storage system 100 shown in FIG. 8 is an example of a distributed storage system that does not use the abovementioned fixed management nodes.
The distributed storage system 100 is provided with a plurality of storage nodes 101 and client nodes 110. A storage node 101 receives a read command from a client node 110, executes the command to read an object data item stored by the storage node 101, and returns the object data item to the client node 110 that issued the command, and also receives a write command from the client node 110, executes the command to write an object data item sent from the client node 110, and storing the object data item.
Here, a description is given concerning a case where, in the distributed storage system 100 shown in FIG. 8, as an example, processing of reading an object data item having a data identifier “000001” is executed. In this processing, a unique number is assigned in advance to each storage node 101. Here, all client nodes 110 have a storage table 111 that stores information of all the storage nodes 101 and numbers allocated to the storage nodes 101. FIG. 9 shows an example of the storage table 111.
When the processing starts, first the client node 110 generates pseudo random numbers, wherein all numbers allocated to the storage nodes 101 stored in the storage table 111 are generated with equal probability, with a data identifier “000001” as a seed (random number seed). An algorithm such as one to generate the same random number sequence for the same seed is used as the algorithm to generate the pseudo random numbers. Generation of pseudo random numbers is repeated until a pseudo random number is generated that is the same as a number allocated to a storage node 101. In a case where a pseudo random number the same as the number allocated to the storage node 101 is generated, an access command is sent to the storage node 101 that has been allocated the number corresponding to the pseudo random number. Next, the storage node 101, to which the number corresponding to the pseudo random number has been allocated, executes the received access command and obtains the object data item. Finally, the storage node 101 sends the object data item to the client node 110, and the read processing is completed.
In the distributed storage system 100, the object data item is distributed to the storage nodes 101 approximately uniformly. In a case of adding a storage node 101 to the distributed storage system 100, or in a case of cutting a storage node 101 from the distributed storage system 100, by moving the minimum of object data items, it is possible to maintain the approximately uniform distribution.
In this way, according to the abovementioned method, it is possible to store object data items in a large number of storage nodes 101 by an algorithm, without using fixed specific nodes that perform centralized control of the entire system. That is, according to the abovementioned system configuration, it is possible to store object data items by uniform distribution with a large number of storage nodes 101. According to the abovementioned system configuration, flexible handling is possible even in a case where the number of storage nodes 101 changes.
NPL 1:
    David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, Rina Panigrahy, “Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web,” 1997.NPL 2:    Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, “CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data,” 2006.