Conventional distributed processing frameworks, such as Hadoop, which uses multiple nodes, are known as a distributed processing technology for processing large amounts of data at high speed. Hadoop divides a file and stores the divided files in multiple nodes and allows, in parallel, each node that manages a divided file to execute a processing instruction, thereby executing distributed processing.
The Hadoop distributed file system (HDFS) is used as the base of the Hadoop data structure. The HDFS is a double structured file system that provides namespaces that are unified among multiple nodes; however, data is managed, in practice, by a double structured file system using a local file system in each node. Specifically, the HDFS is a double structured file system constituted by a user layer application that manages name spaces and an operating system that manages input and output of physical files.
FIG. 9 is a schematic diagram illustrating an example of the overall configuration of a conventional distributed processing system. The distributed processing system illustrated in FIG. 9 includes three nodes, i.e., a node A, a node B, and a node C. An application that uses a distributed processing framework is executed in each of the nodes. The node A is connected to a local disk A, the node B is connected to a local disk B, and the node C is connected to a local disk C. The node C is the master node that executes name management that manages which file is stored in which local disk.
The distributed processing system described above creates a duplicate of a file and stores therein the original file and the duplicate file. For example, in the distributed processing system, if a file C is stored in the node C, a duplicate file C, which is a duplicate of the file C, is also stored in the node B or the node A. In this way, the distributed processing system implements file redundancy.
In the following, a processing example will be specifically described by using, as an example, a reference process that is part of distributed processing. At this point, it is assumed that an application A executed by the node A requests the node C to refer to a file A. In such a case, the node C specifies, using the name management, that the file A is stored in both the local disk B and the local disk C. Then, the node C responds to the application A that the local disk B, which is closer to the node A, which is the request source, is to be used as the storage destination of the file A. The application A, which receives a response, requests that the node B connected to the local disk B reads the file A and then refers to the file A that is read from the local disk B.
Furthermore, in the distributed processing system, files are managed in a memory area reserved for an application by each node. When an application makes a request for a file to be referred to, the subject file is read from the memory, which makes it possible to reduce the processing time.
Patent Document 1: Japanese Laid-open Patent Publication No. 2005-234919
Patent Document 2: Japanese Laid-open Patent Publication No. 11-15718
Patent Document 3: Japanese Laid-open Patent Publication No. 07-182220
However, with the distributed processing system described above, there is a problem in that the processing performance is not improved because page caches managed by an OS kernel are not efficiently used and thus an input/output of a local disk occurs.
For example, if distributed processing occurs in the distributed processing system illustrated in FIG. 9, the node C determines that the file to be processed is the file that is stored in the local disk that is the closest to the request source node from among local disks that stores therein the file to be processed. Accordingly, there may be a case in which, instead of a node that stores the subject file in a page cache that is managed by the OS kernel, a node that does not store the subject file in a page cache managed by the OS kernel is requested to process the subject file.
In such a case, the node that is requested to process the file reads the file from a local disk whose processing speed is lower than that of the page cache that is managed by the OS kernel. In other words, in terms of the entire distributed processing system, even when a file stored in the page cache managed by the OS kernel is the file to be read, the file may sometimes be read by executing an input/output process of a low-speed disk.
Furthermore, the node that has executed input/output processing of the low-speed disk stores the file read front the local disk in the page cache managed by the OS kernel. At this point, after discarding other files from the page cache managed by the OS kernel, the node caches the read file, resulting in the reduction of the cache hit rate of the other files.
As described above, with the conventional distributed processing system, a page cache is not efficiently used, and thus it is difficult to improve the processing performance. Furthermore, a method for improving the throughput by storing data in a memory area reserved for an application is impractical because a large volume of files needs to be stored in the memory.