Distributed data processing systems may be used to process and analyze large datasets. An example of such a framework is Hadoop, which provides data storage services to clients using a Hadoop Distributed File System (HDFS) and data processing services though a cluster of commodity computers or nodes. The HDFS executes on the cluster of computers (also referred to as compute nodes or processing nodes) to enable client access to the data in the form of logical constructs organized as blocks, e.g., HDFS blocks. The compute nodes operate mostly independently to achieve or provide results toward a common goal. Compute nodes of the cluster typically have their own private (e.g., shared-nothing) storage. As a result, the data stored in the storage of one compute node may not be accessed by another compute node.
This may be problematic because if a particular compute node fails, the data stored in the storage of that compute node becomes inaccessible. Accordingly, in some cases, multiple copies of the distributed data set (e.g., replicas) are created and a copy is stored in the storage of multiple (e.g., all) compute nodes of the cluster. However, additional instances of the distributed data set can result in data (or replica) sprawl. Data sprawl can become a problem because it increases the costs of ownership due, at least in part, to the increased storage costs. Further, data sprawl burdens the network resources that need to manage changes to the replicas across the distributed processing system.
In some other cases, to avoid the above described problem of increased costs of ownership, the storage of the compute nodes are configured to be shared between compute nodes, at least for read operations. That is, the compute nodes can read data from the storage of other compute nodes. This avoids the problem of replica sprawl. However, since a particular block of data is stored on a single storage, the storage can be a single point of failure of the distributed file system.
Thus, prior distributed file systems lack efficient data management techniques. There exists a need for efficient data management techniques that eliminates the problem of a storage being a single point of failure while minimizing the number of data replications.