In recent years, a distributed file system which stores data using a plurality of distributed information processing devices is known well along with increase of a volume of data treated by information processing devices. For example, GFS (Google File System) of Google Inc. (here, Google is a registered trademark) and open source HDFS (Hadoop Distributed File System) realize a storage of more than 1 PB (petabyte) capacity by combining a plurality of information processing devices. Such distributed file system can store data that increases on a daily basis such as a web page and log data. Pieces of data to be stored in such distributed file system are efficiently processed in a distributed manner by a distributed processing framework such as MapReduce and Hadoop, respectively.
Such distributed file system is generally equipped with a function to generate a copy of data to be stored (object data). As purposes for making a copy of object data, there are mainly two of them described below.
That is, the first purpose is to secure the fault tolerance of a file system. Because a distributed file system is configured by a plurality of information processing devices, one of the information processing devices may cause a failure. Therefore, a distributed file system creates a copy of object data and stores the copied object data in a different information processing device. As a result, a distributed file system secures a state that the object data is always being backed up. Even when a certain information processing device has a trouble, the object data is not lost as a whole distributed file system because a copy of the object data is stored in a different information processing device.
The second purpose is to mitigate concentration of access to identical data. That is, a distributed file system makes copies of specific object data which is frequently accessed, and stores the copied pieces of the object data in a plurality of information processing devices included in the distributed file system, separately. As a result, even when read requests occur from many programs simultaneously to specific object data, a load of each information processing device constituting a distributed file system can be dispersed. As a result, such distributed file system can provide data access free from a bottleneck.
Here, an example of related technology about determining a placement destination of object data to be stored in such distributed file system will be described. Hereinafter, it is supposed that information processing devices constituting a distributed file system are stored in racks (server racks) installed in a data center. It is also supposed that a plurality of information processing devices stored in racks are connected so that they can communicate with each other by a communication network (hereinafter, it is abbreviated as “network”). Moreover, it is supposed that, also between a plurality of racks, a plurality of information processing devices which are stored in different racks are connected so that they can communicate with each other by a network. Generally, a band of network communication between a plurality of information processing devices in different racks is narrow compared with that of network communication between respective information processing devices in an identical rack.
For example, such distributed file system reproduces one piece of object data into three pieces of object data (=(one original+(two copies)), first. Then, a distributed file system places first object data in one information processing device stored in a certain rack, places second object data in a different information processing device mounted in the identical rack, and allocates third object data in an information processing device which is mounted in a rack different from the former rack. As a result, the object data is stored using a plurality of racks. Accordingly, even when a failure occurs in one rack, access to the object data is guaranteed. In the example mentioned above, because two racks are used, the cost needed for writing and updating object data is small compared with a case where each of the three pieces of the object data is placed in a different rack separately. Therefore, a distributed file system for determining a placement destination of object data in this way improves performance of writing and updating while maintaining reliability of the object data to be stored.
Also, the technology disclosed in non-patent literature 1 can be applied to such distributed file system as another technology for determining a placement destination of object data. Non-patent literature 1 relates to a technology for reproducing a line of a database. However, when paraphrasing a line into object data in the technology disclosed in non-patent literature 1, the technology is applicable in a distributed file system. A distributed file system to which such technology disclosed in non-patent literature 1 is applied determines placement destinations of a plurality of pieces of copied object data based on relevance between pieces of data. Here, “a plurality of pieces of data having relevance with each other” means pieces of data read by an identical application in identical processing. Henceforth, in the present application, that a plurality of pieces of data are accessed by an identical application in identical processing may be described as “a plurality of pieces of data are used simultaneously.” Such distributed file system places a plurality of pieces of data having a high possibility of being used by an identical application simultaneously in an identical rack.
Specifically, a distributed file system to which the technology disclosed in non-patent literature 1 has been applied performs graph partitioning by expressing relevance between pieces of object data which should be stored by a graph. In this graph, data (or a data set which is a set of pieces of data) is represented as a node. In such a graph, relevance between pieces of object data is represented by a side that connects between nodes. Graph partitioning is a known problem of making the number of nodes for each divided graph equal as much as possible, and, at the same time, making the number of sides that cross a divided graph as small as possible. Thus, when the technology disclosed in a non-patent literature 1 is applied, a distributed file system can make determination of a placement destination of object data be boiled down to a graph partitioning problem. Meanwhile, because it is NP-hard (Non-deterministic Polynomial time-Hard) to find the optimal solution of graph partitioning, heuristic or an approximation algorithm is generally applied. As a result, such distributed file system can determine a placement destination of each piece of data so that there may be no partiality in data transmission volumes between racks as much as possible, and relevant data may be placed in an identical node or an identical rack.
In this way, a distributed file system to which the technology disclosed in non-patent literature 1 has been applied can make data transfer associated with processing which uses a plurality of pieces of object data simultaneously be carried out within one rack. As a result, such distributed file system is able to speed up processing which uses a plurality of pieces of object data simultaneously.
In order to express relevance of a plurality of pieces of data that have been copied by a graph, a distributed file system to which the technology disclosed in non-patent literature 1 has been applied needs information representing relevance between each of the copied pieces of data in advance. Such distributed file system acquires information representing relevance between pieces of data based on characteristics of access from outside to the pieces of data having been placed once. Accordingly, a technology which determines a placement destination of object data thus is mainly used in order to change a placement destination appropriately according to access characteristics to data after having placed the data once.
Another technology which acquires such relevance between pieces of data is disclosed in patent literature 1. In the technology disclosed in patent literature 1, a degree of association between documents is acquired based on quotation relation and keyword sharing relation between the documents.