The present invention relates to file storage, and more specifically, to a method and system for processing file storage in HDFS.
Hadoop Distributed File System (HDFS) is a widely-used distributed file system that is designed to be highly fault tolerant, be deployed on inexpensive hardware, store large scale data sets, and stream those data sets at high bandwidth to user applications.
A HDFS cluster includes a NameNode and a configurable number of DataNodes. The NameNode is a central server responsible for managing the NameSpace, which is a hierarchy of files and directories that clients access. A DataNode cluster is a server that stores and manages which is responsible for managing storage of the node where it resides.
Within HDFS, one file is split into one or more blocks, which are stored in a set of DataNodes. The NameNode is used to manipulate file or directory operation in a file namespace, such as open, close, rename, etc, and also to determine mapping between a block and a DataNode. The DataNode is responsible for read and write requests from clients of the file system, creation and deletion of an executable block, block copy instruction from the NameNode, and the like.
In HDFS, block, file or directory are all stored in memory in form of an object. Each object takes about 150 bytes. If there are 10,000,000 small files and each file occupies one block, then approximately a 2 Gigabytes (GB) space is required by the NameNode. If 100 million files are stored, then a 20 GB space is required by the NameNode. Thus, memory capacity of the NameNode will seriously restrict extension of the cluster.
Secondly, HDFS is initially developed for streamed access to large files. Speed of processing a large number of small files is much lower than that of processing a large file with equal size. If a large number of small files are accessed, there is a need to constantly jump from one DataNode to another DataNode, and speed of accessing a large number of small files is much lower than that of accessing several large files, thereby seriously affecting performance. Moreover, starting a task consumes a large amount of time, as does terminating the task, and switching between tasks. Therefore, it is desirable to provide a solution capable of enhancing capability of a HDFS to process small files.