In recent years, there has been known a map-reduce type distributed processing system as a processing system for processing large-volume data, such as web data.
The map-reduce type distributed processing system divides data on the distributed processing system into units, called data blocks, and applies the data blocks map processing and reduce processing sequentially.
According to such a map-reduce type distributed processing system, a series of calculation processes for each data block can be distributed to a plurality of calculation nodes and be executed simultaneously.
Hadoop (registered trademark) is an open source software (OSS) framework for efficiently performing distributed processing and management of large-volume data and is mainly used for analysis processing. Hadoop is applied to a batch processing of a mission-critical system, so that data is distributed to and processed by a plurality of machines, thereby achieving acceleration of large-scale batch processing in which it is required to shorten processing time.
In Hadoop, a master node assigns tasks to a plurality of slave nodes respectively and the slave nodes perform map tasks (Map task) assigned by the master node.
In Hadoop, a file is divided into blocks having the certain size and processing is performed in the map task for each block.
FIG. 15 is a view illustrating an operation overview of a map-reduce framework of Hadoop. In an example illustrated in FIG. 15, a file of 196 MB managed by a Hadoop distributed file system (HDFS) is divided into three blocks having a data size of 64 MB and the three blocks are processed in parallel in three map tasks. Data output from the map task are output as sorted result files by going through shuffle&sort and reduce tasks, and are returned to HDFS.
In such a distributed processing system, it is important that, although a file is divided, data itself is not divided. For example, data called “orange” to be transferred to the map task should not be divided into “oran” and “ge”. In such a case where data which should be treated as a single unit is divided could be referred to as data separation.
For this reason, after file dividing of 64 MB is performed, it is necessary to adjust a dividing position of the data. In the default of Hadoop, a line-feed code is used for data dividing, and processing is performed at the position of the line-feed code for dividing, thereby preventing unwillingness data separation. The processing of adjusting the dividing can be customized, and the customization can be achieved by using, for example, an arbitrary character for dividing.
FIG. 16 is a view illustrating a format of a variable-length record sequential file of NetCOBOL.
As illustrated in FIG. 16, the variable-length record sequential file of NetCOBOL is configured by successively connecting a plurality of variable-length records each having record length information of 4 bytes before and after data. Further, the same value is stored in the record lengths arranged before and after the data.
In addition, a user does not need to set and refer to the record length information and a COBOL runtime system performs the setting and referring.
FIG. 17 is a view illustrating a record image of the variable-length record sequential file of NetCOBOL.
In an example illustrated in FIG. 17, although it is viewed that a new line begins for each variable-length record, in practice, a plurality of variable-length records are continuous to one another.
In the case of using the variable-length record sequential file of NetCOBOL in Hadoop, it is difficult to adjust dividing of data after file dividing is performed in the unit of a block. The reason for this is that when the data is a binary value which can be arbitrarily set by a user, and a line-feed code or an arbitrary character is used for dividing, information identical to the line-feed code or dividing character is included and it is difficult to specify a dividing position of data.
FIG. 18 is a view illustrating a record image in a case where a variable-length record sequential file of NetCOBOL is divided into two portions.
In an example illustrated in FIG. 18, when a file is divided in the unit of a block size of 64 MB, data having a data length of 105 bytes is illustrated as being divided in the middle thereof.
Therefore, in NetCOBOL, in order to exactly calculate a position of a record and adjust a data dividing position, there has been used a method of previously generating an information file that maintains a distance (byte length) from a dividing position by a block size to a head position of a subsequent variable-length record and referring to the information file upon processing of dividing adjustment.
FIG. 19 is a view illustrating a variable-length record sequential file of NetCOBOL and an information file.
The information file reads the variable-length record sequential file which is an input file, adds record lengths, and retains information from a position at which dividing is performed in a block size to a subsequent record head position.
In an example illustrated in FIG. 19, in a case where, the information file, for example, a file is divided at positions of 64 MB from a head, a data length from a dividing position to a head of a subsequent variable-length record (data length is 20 bytes) is illustrated as being 55 bytes. Further, there is a need to also designate a block size as a parameter.
[Patent Literature 1] Japanese Laid-open Patent Publication No. 2012-118669
[Patent Literature 2] Japanese National Publication of International Patent Application No. 10-500793
[Patent Literature 3] Japanese Laid-open Patent Publication No. 03-62137
However, in order to generate the above-described information file in an existing distributed processing system, dividing position is calculated by sequentially reading and adding record lengths of variable-length records in an entire processing target file from the beginning thereof. To this end, in a large-sized file of which the data size is dozens to hundreds of GB, and the number of the records is millions, much time is taken to generate the information file. For example, in some cases, a file having a data size of 80 GB takes 15 minutes to generate the information file.
Therefore, Hadoop is introduced in order for reduction in processing time, but time to generate the information file is required. Therefore, in the terms of entire processing time, the effect of shortening time due to Hadoop is ineffective.