In general, distributed file systems are designed to store large amounts of data and provide various client access to the stored data distributed across a network. One well-known distributed file system is the Hadoop Distributed File System (HDFS). HDFS is a distributed file system that is structured to store very large amounts of data (terabytes or petabytes), and provide high-throughput access to the stored data. In an HDFS storage system, files are stored in a redundant fashion across multiple server nodes to ensure durability to failure and to ensure availability for highly parallel applications.
In a current state of the art implementation of HDFS, data that is accessed by an HDFS client is stored across a network of distributed data stores that are managed by corresponding data node servers and accessed via an HDFS protocol. The HDFS protocol utilizes a namenode function (or namenode server) to manage a namespace of the HDFS data stores. Compute nodes consult the namenode server to determine where to read/write data from/to. One version of the HDFS protocol, version 1, supports only one set of HDFS data stores per compute cluster because it supports only one namespace. Another version of the HDFS protocol, version 2, supports multiple HDFS namespaces per cluster and federation, which allows compute nodes to navigate to all namespaces and corresponding data stores. There are some limitations with the current HDFS protocols.
For example, if a Hadoop compute cluster uses HDFS, v1, additional data stores cannot be added to the original data stores of the compute cluster when the original data stores run out of storage space, i.e., the original cluster of data stores cannot be expanded. Not only does this restrict expansion of the HDFS system, it also means that an original data store must be abandoned if a second HDFS data store must be used. This circumstance arises, for example, when a customer deploys a Hadoop cluster with an internal data store on internal disk and later wants to use an external data store on a cluster file system. The only option is to abandon the existing data store and start using the new data store, which is disruptive and time consuming as large amounts of data must be transferred. Further, even if a customer is using HDFS v2 which supports the simultaneous use of multiple HDFS stores, there is no efficient way to balance and tier the data between the different stores to optimize performance and/or capacity utilization, and data protection.