Distributed file systems offer many compelling advantages in establishing high performance computing environments. One example is the ability to easily expand, even at large scale. The Hadoop Distributed File System (“HDFS”) is a distributed file system, designed to run on commodity hardware, that stores data across a number of Datallodes. Not only is data stored across a number of Datallodes, individual files or objects are broken down into data blocks that can be stored and/or mirrored on different Datallodes. It can be appreciated that by storing data across a number of Datallodes, the HDFS is more tolerant to hardware failure.
HDFS is a designed under a master/slave architecture. Each HDFS cluster consists of a single NameNode that acts as a master server that manages the file system namespace and regulates access to files by clients. A plurality of Datallodes operate as slaves to the NameNode, usually configured one per node, that manage storage attached to the Datallode. Within the HDFS cluster, files are split into one or more blocks and these blocks are stored in the set of Datallodes. The NameNode controls operations like opening files, closing files, renaming files and directories, and mapping of blocks to Datallodes. The Datallodes then operate to serve read and write requests made by the clients of the HDFS. Datallodes also perform block creation, deletion, and replication based on instructions received from the NameNode.
In processing reads or writes, an HDFS client first makes a call to the NameNode to determine how to proceed. For example, in the context of a write, an HDFS client will cache the write data locally on the client in a temporary file. When the temporary file accumulates data over a certain a threshold, the client will contact the NameNode with the request to write data to the HDFS, the NameNode can insert the file name into the file system and allocate data blocks in Datallodes. The NameNode then responds to the client with the identity of the Datallode(s) and the destination data block address(es) where the write data will be stored in the HDFS. Similarly, for read requests, an HDFS client will first contact the NameNode to determine the Datallode and associated block addresses where the data is stored that is necessary to transact the read request. The client will then contact the Datallodes and request the data from the associated block addresses. In both instances, HDFS read requests and HDFS write requests, an HDFS client first contacts the NameNode with the overview of their request, and then waits for the NameNode to respond with the relevant information to continue processing the request.