In recent years, there has been a need for a technique to process large amounts of data (hereafter, referred to as “big data”) that occur in a website, sensor, mobile terminal and the like in a short period of time.
A distributed processing framework called Hadoop is known a technique for processing big data. In Hadoop, one file is divided into plural files based on a distributed processing method called MapReduce, and the files are processed in parallel at plural nodes. A feature of Hadoop is that processing performance increases as the number of nodes increases. Therefore, by using Hadoop, a large-scale system in which several petabytes or more of data are processed at several tens to several thousands of nodes can be achieved comparatively easily.
Hadoop employs a distributed file system called HDFS (Hadoop Distributed File System) in order to effectively perform distributed processing. HDFS divides a file into specific sizes, distributes and stores the divided files among plural servers, creates copies of the divided files, and saves the copied files on different servers from the servers where the copy-source files are saved. In distributed processing, because either the copy-source files or the copied files are processed, by the HDFS allocating the processing among servers in which there is space for work processes, the server resources (CPU (Central Processing Unit) resources and the like) can be used effectively.
However, in a distributed file system such as HDFS, the status of the OS (Operating System) cache is not taken into consideration when dividing the processing. Therefore, because a distributed processing system allocates processing among different servers without taking that into consideration even when there is data in the cache of a server, disk I/O (Input/Output) occurs. Disk I/O is processing that takes time; for example, approximately half of the time required for processing several gigabytes of data is taken up by disk seek time.
The processing of big data by a distributed file system will be explained using FIG. 1. File A is stored in a disk of server 1, file B is stored in a disk of server 2, a copy of the file B and file C are stored in a disk of server 3. Moreover, the file B is loaded into a cache of the server 3. An application program that is arranged on one of the servers of the distributed processing system performs processing in the order of the file A, the file B and the file C. After performing processing of the file A on the server 1, the application program processes the file B on the server 2 or the server 3. By performing processing of the file B on the server 3, disk I/O does not occur because it is possible to use the cache. However, the server 2 may sometimes be selected, because whether to select the server 2 or the server 3 is determined according to a situation of available processes on the server 2 or the server 3. In that case, the time until processing of the file is finished may be greatly delayed because disk I/O occurs.
Therefore, from the aspect of completing the processing of a file in a short amount of time, it is favorable to improve a cache hit rate in the distributed processing system.
A technique such as the following is known for distributed processing and cache management. More specifically, a MapReduce processing system divides data to be processed into plural groups based on frequency of updating data, and calculates group update frequency based on the frequency of updating data included in a group. Then, the MapReduce processing system generates partial results of MapReduce processing stages of a group for which the group update frequency is equal to or less than a threshold value, and caches the generated partial results. As a result, the cache is effectively used in the MapReduce processing.
Moreover, as for using the cache effectively, there is also a technique such as the following. More specifically, the data of all or part of the areas of a file are multiplexed when opening the file, and together with distributing the multiplexed data among plural drives, the multiplexed data of the file is deleted when the file is closed.
Moreover, there is also a technique for preventing a decrease in throughput due to a concentration of access requests for a certain file server from plural clients. More specifically, a master file server selects a file server having a light load, and allocates file access requests that were transmitted from client to the selected file server.
Moreover, there is a technique for providing high-speed file access regardless of the state of the wide-area network. More specifically, a cache server has an access log database that records file access, and determines files to be read in advance by analyzing intervals between updates of files having a high access frequency by using the access log database. Then, the cache server reads files having a high access frequency in advance during a time period in which the wide area network is available, and provides the files read in advance in response to a request from a client.
Moreover, there is also a technique that improves a cache hit rate and increases the speed of file access by preferentially caching files having a high frequency of use for each user that logs onto a client.
However, even when these techniques are used, it may not always be possible to sufficiently improve the cache hit rate of a distributed processing system. Moreover, the conventional techniques are also insufficient from the aspect of effectively using the cache of a server.    Non-Patent Document 1: DEAN Jeffrey et al., “MapReduce: Simplified Data Processing on Large Clusters”, [online], December 2004, Symposium on Operating System Design and Implementation 2004, [retrieved on Jan. 10, 2013], Retrieved from the Internet: <URL: http://research.google.com/archive/mapreduce.html>    Patent Document 1: Japanese Laid-open Patent Publication No. 2010-92222    Patent Document 2: Japanese Laid-open Patent Publication No. 6-332625    Patent Document 3: Japanese Laid-open Patent Publication No. 6-332782    Patent Document 4: Japanese Laid-open Patent Publication No. 11-24981    Patent Document 5: Japanese Laid-open Patent Publication No. 7-93205