Distributed file systems offer many compelling advantages in establishing high performance computing environments. One example is the ability to easily expand, even at large scale. The Hadoop Distributed File System (“HDFS”) is a distributed file system, designed to run on commodity hardware, that stores data across a number of DataNodes. Not only is data stored across a number of DataNodes, individual files or objects are broken down into data blocks that can be stored and/or mirrored on different DataNodes. It can be appreciated that by replicating data across a number of DataNodes, the HDFS is more tolerant to hardware failure.
HDFS is a designed under a master/worker architecture. Each HDFS cluster consists of a single NameNode that acts as a master server that manages the file system namespace and regulates access to files by clients. A plurality of DataNodes operate as workers to the NameNode, usually configured one per node, that manage storage attached to the DataNode. Within the HDFS cluster, files are split into one or more blocks and these blocks are stored in the set of DataNodes. The NameNode controls operations like opening files, closing files, renaming files and directories, and mapping of blocks to DataNodes. The DataNodes then operate to serve read and write requests made by the clients of the HDFS. DataNodes also perform block creation, deletion, and replication based on instructions received from the NameNode.
In processing reads or writes, an HDFS client first makes a call to the NameNode to determine how to proceed. For example, in the context of a write, an HDFS client, some implementations, can cache the write data locally on the client in a temporary file. When the temporary file accumulates data over a certain a threshold, the client will contact the NameNode with the request to write data to the HDFS, the NameNode can insert the file name into the file system and allocate data blocks in DataNodes. The NameNode then responds to the client with the identity of the DataNode(s) and the destination data block address(es) where the write data will be stored in the HDFS. Similarly, for read requests, an HDFS client will first contact the NameNode to determine the DataNode and associated block addresses where the data is stored that is necessary to transact the read request. The client will then contact the DataNodes and request the data from the associated block addresses. In both instances, HDFS read requests and HDFS write requests, an HDFS client first contacts the NameNode with the overview of their request, and then waits for the NameNode to respond with the relevant information to continue processing the request.