The present invention relates to a method and an apparatus for controlling data using data processing history information or server operating history information.
In recent years, big data such as data acquired from a sensor device such as radio frequency identification (RFID) or an access log to a website can be accumulated by greater capacity and less cost of a recording medium. Therefore, companies and organizations attempt to analyze the big data which has been accumulated up to now, but not utilized and utilize the analysis result in a business. As a technology that analyzes the big data for a short time, a parallel distributed processing technology attracts attention. However, a utilization method or an analysis method of data is not clearly determined for log data, and the like which have not been utilized up to now, and trial and error are required. In the parallel distributed processing, since the processing is divided and the divided processing is allocated to multiple servers to be distributively executed in parallel, the multiple servers need to be prepared. As a result, since an effect to investment by introduction of a parallel distributed processing system in an initial stage is obscure, it is difficult to introduce the parallel distributed processing system to customers.
Therefore, by introducing the parallel distributed processing system, it is considered that a new server is not prepared, but an empty resource of a server used by an existing system is effectively utilized, and the existing system and the parallel distributed processing system coexist.
In the parallel distributed processing, since processing target data is divided into blocks having defined sizes and the respective blocks are independently processed in parallel simultaneously in multiple servers, the big data can be processed for a short time. Under a situation in which the processing target data is distributed and stored in each server that executes the parallel distributed processing, when processing of data which is stored in another server is allocated to a task on any server, data is transferred between servers, thereby causing processing delay. Therefore, it is disclosed a scheduling method of allocating processing to a task of the networkedly closest server in the servers storing the processing target data at the time of allocating the processing by considering processing efficiency of the parallel distributed processing {see “Hadoop: The Definitive Guide First Edition” written by Tom White, published in Oreilly Media which was issued on January 2010, p. 155 (Non-patent Document 1)}.
Further, a scheduling method is disclosed, which calculates a rate of processing allocation completed data for the stored data and allocates processing of data stored in a server having the smallest rate when processing of the data stored in another server is allocated, in order to reduce the number of transmission times of data between the servers {see Japanese Unexamined Patent Application Publication No. 2010-231502 (Patent Document 1)}.
It is assumed that a priority of the existing system is higher than that of the parallel distributed processing system when the existing system and the parallel distributed processing system operate together. Therefore, the parallel distributed processing system needs to execute processing by using an empty resource of the server so as to prevent the operation of the existing system from being interrupted. As a result, execution multiplicity needs to be dynamically changed to respond to a variation in the load of the existing system for each server, and a difference in a processable data amount of each server per unit time easily occurs. Further, since the execution multiplicity is dynamically changed, it is difficult to arrange data in each server so as to reduce data transfer according to processing performance of each server before executing the parallel distributed processing and there is a possibility that transmission processing of big data will be performed. Under an environment in which the execution multiplicity is dynamically changed, a method of allocating processing of data to each server so as to reduce a data transfer cost while executing the parallel distributed processing is important in terms of efficiency of the parallel distributed processing.