The Hadoop Distributed File System (HDFS) namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by Inodes. Inodes record attributes like permissions, modification and access times, namespace and disk space quotas. The file content is split into large data blocks (typically 128 MB), and each data block of the file is independently replicated at multiple DataNodes (typically three). The NameNode is the metadata service of HDFS, which is responsible for namespace operations. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes. That is, the NameNode tracks the location of data within a Hadoop cluster and coordinates client access thereto. Conventionally, each cluster has a single NameNode. The cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently. The Inodes and the list of data blocks that define the metadata of the name system are called the image. NameNode keeps the entire namespace image in RAM. The persistent record of the image is stored in the NameNode's local native filesystem as a checkpoint plus a journal representing updates to the namespace carried out since the checkpoint was made.
A distributed system is composed of different components called nodes. To maintain system consistency, it may become necessary to coordinate various distributed events between the nodes. The simplest way to coordinate a particular event that must be learned consistently by all nodes is to choose a designated single master and record that event on the master so that other nodes may learn of the event from the master. Although simple, this approach lacks reliability, as failure of the single master stalls the progress of the entire system. In recognition of this, and as shown in FIG. 1, conventional HDFS implementations use an Active NameNode 102 that is accessed during normal operations and a backup called the Standby NameNode 104 that is used as a failover in case of failure of the Active NameNode 102.
As shown in FIG. 1, a conventional HDFS cluster operates as follows. When an update to the namespace is requested, such when an HDFS client issues a remote procedure call (RPC) to, for example, create a file or a directory, the Active NameNode 102, as shown in FIG. 1:                1. receives the request (e.g., RPC) from a client;        2. immediately applies the update to its memory state;        3. writes the update as a journal transaction in shared persistent storage 106 (such as a Network Attached Storage (NAS) comprising one or more hard drives) and returns to the client a notification of success.        The Standby NameNode 104 must now update its own state to maintain coherency with the Active NameNode 102. Toward that end, the Standby NameNode 104        4. reads the journal transaction from the transaction journal 106, and        5. updates its own state        
This, however, is believed to be a sub-optimal solution. For example, in this scheme, the Transaction Journal 106 itself becomes the single point of failure. Indeed, upon corruption of the transaction journal 106, the Standby NameNode 104 can no longer assume the same state as the Active NameNode 102 and failover from the active to the Standby NameNode is no longer possible.
Moreover, in Hadoop solutions that support only one active NameNode per cluster, standby servers, as noted above, are typically kept in sync via Network Attached Storage (NAS) devices. If the active NameNode fails and the standby has to take over, there is a possibility of data loss if a change written to the Active NameNode has yet to be written to the NAS. Administrator error during failover can lead to further data loss. Moreover, if a network failure occurs in which the active server cannot communicate with the standby server but can communicate with the other machines in the cluster, and the standby server mistakenly assumes that the active server is dead and takes over the active role, then a pathological network condition known as a “split-brain” can occur, in which two nodes believe that they are the Active NameNode, which condition can lead to data corruption.