As cloud computing has widespread, distributed processing systems have been utilized in which multiple servers perform distributed processing on the large amount of data stored on the cloud. Hadoop (registered trademark) that uses the Hadoop Distributed File System (HDFS) and MapReduce as its infrastructure technologies is known as a distributed processing system.
The HDFS is a file system that distributes and stores data to a plurality of servers. MapReduce is a system that performs distributed processing on data of the HDFS on what is called a per task basis and that performs a Map operation, a shuffle/sort operation, and a Reduce operation. A Map operation and a Reduce operation are typically developed by using Java (registered trademark), and a shuffle/sort operation is provided with Hadoop as a standard feature. Typically, the above Hadoop processes one type of input by MapReduce and produces one type of output.
In recent years, application programs (hereafter, referred to as “external program” as appropriate) for batch processing or the like have been efficiently executed by using Hadoop, such programs being not provided with Hadoop as standard features. The external program uses multiple inputs that have different formats as a target for processing and is typically developed by using a program of other than Java (registered trademark).
For example, Hadoop Streaming that is the standard tool of Hadoop has been known as a technology for executing an external program with Hadoop. Hadoop Streaming is a technology for calling the external program during a Map operation or Reduce operation. Specifically, during a Map operation or Reduce operation, the external program is called once for a single task, and the operation result is output to the standard output of the external program.
Furthermore, a reduce-side join has been known as a Hadoop related technology for processing multiple types of inputs. For example, when the external program for a join operation is to be executed, the input file name, the class for processing the input format, and the class for performing a Map operation are defined for each type of input so that a Map operation is performed, and then data is output in which a join key is related to tuple that is to be joined. Next, during a shuffle/sort operation, data sets are sorted by using a join key, and the data sets are grouped by each join key for output. Afterward, a Reduce class is defined for each type of input so that a Reduce operation is performed, and then a join operation is performed.    [Non-patent Document 1] Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pp. 137-150, Dec. 6, 2004.    [Non-patent Document 2] Apache Hadoop 1.0.3 documentation, “Hadoop Streaming”, URL “http://hadoop.apache.org/docs/r1.0.3/streaming.html”    [Non-patent Document 3] Tom White, Hadoop 2nd Edition, 8.3.2 reduce-side join, P269-272, O'Reilly Japan, issued on July, 2011
Because there are many constraints on the execution of the external program, which uses multiple inputs as a target for processing, in the distributed processing system, such as Hadoop, the reality is that the execution of the external program in the distributed processing system is difficult.
Specifically, with Hadoop Streaming, the data to be processed is output through the standard input/output; therefore, it is difficult to call and execute the external program that receives the to-be-processed data with the argument or environment variable.
As for a reduce-side join, the program that performs the same operation as that of the existing external program is redeveloped and ported in Hadoop; therefore, because of the risk of redevelopment and porting, it is difficult to port the external program on a frequent basis, which results in low development possibility.
For example, when a reduce-side join is used, a Map operation class that is to be processed during a Map operation is implemented for each input to be processed by the external program, and a Reduce operation class that is to be processed during a Reduce operation is implemented for each input. Furthermore, when the above classes are implemented, data is defined by using a key value store (KVS) format that is different from that of relational database (RDB) files or comma separated values (CSV) files that are obtained by unloading RDB files, the RDB and CSV files being to be processed by the external program. Moreover, if the program is developed by other than Java, it is redeveloped by Java for porting so that the program can be called during a Reduce operation of the reduce-side join.
That is, when the reduce-side join is used, a Map operation and Reduce operation are performed as if multiple types of inputs are one type of input, and the ported external program is called and executed during the Reduce operation.
However, a Map class and Reduce class are manually implemented without the support of devices with respect to complicated data that includes hundreds of columns to be processed by the external program, which results in an increase in the time and the risk of human-caused mistakes. Furthermore, porting an external program has a high risk and it is not a desirable method. That is because complicated operational logics are implemented in the external program and many tests have been repeatedly performed on it so that the actual executions are abundant; therefore, porting of the external program is not easy.
As described above, when the external program is executed by using Hadoop Streaming or a reduce-side join, there are an increase in the operating time, an increase in human-caused mistakes, and the risk associated with porting of the external program; therefore, it is difficult to execute the external program with Hadoop, and the development possibility of Hadoop is decreased.