1. Field of the Invention
The present invention relates to a method and an apparatus for big size file blocking for distributed processing. More particularly, the present invention relates to a method and an apparatus for big size file blocking so that working nodes that distributes and process blocked big size files can substantially simultaneously complete processing the big size files.
2. Description of the Related Art
Provided is a technology that distributively processes big size data. Further, provide is a technology that distributively stores a file including the bit size data through different computing devices.
For example, in a Hadoop distributed file system (HDFS) which is a distributed file processing system of Hadoop which is a widely known big size data processing platform, the big size data is divided in a block wise to be distributively stored in a clustered data node. Further, meta information of each block, which is distributively stored is stored in a name node. The meta information may include, for example, storage position information of each block. When the block is formed with a fixed size, as a data size increases, the number of blocks increases, and as a result, the size of the meta information will increase. Table 1 given below an increase in size of the meta information accompanied by distributive storage of files is shown according to a file size.
TABLE 1The number of blocksMeta information size (wise, MB)File size(64 MB per block)(150K bytes per block)1GB160.0010GB1600.02100GB1,6000.231TB16,3842.3410TB163,84023.44100TB1,638,400234.381PB16,777,2162,400.00
As shown in Table 1 given above, it may be known that for example, a file having a 1 PB size is divided 16,777,216 blocks to be stored, and as a result, only the meta information size reaches 2,400 MB (2.34 GB). Since the meta information needs to be frequently accessed, the meta information needs to be loaded onto a memory. Operating data which reaches approximately 2.34 GB while loading the corresponding data onto the memory is significantly burdensome. In addition to a burden in operating the meta information depending on the distributive storage, even a burden in creation and operation of each block processing task depending on the distributive processing will be generated. The reason is that a task processing history needs to be managed.
Even so, the block size cannot be thoughtlessly increased. The reason is that an effect of work distribution depending on the distributive processing deteriorates.
Therefore, providing an efficient file blocking method which can suppress a level to increase the number of blocks as the file size increases and a distributive processing management method of a big size file using the method is requested. In the specification, blocking indicates dividing the file in a block wise.