Distributed data processing systems may be used to process and analyze large datasets. One such framework used to efficiently process and analyze large datasets is Hadoop, which provides data storage services to clients using a Hadoop Distributed File System (HDFS) and data processing services though a cluster of commodity computers or nodes. The HDFS executes on the cluster of computers (also called as compute nodes or processing nodes) to enable client access to the data in the form of logical constructs organized as blocks, e.g., HDFS blocks. The compute nodes operate mostly independently, to achieve or provide results toward a common goal.
In many enterprise data networks, a mix of different distributed file systems are being used to manage the data stored within the networks. For instance, many enterprise data networks use Network File System (NFS) to provide data storage services to clients while using HDFS with Hadoop to provide data processing services for the stored data. In such networks, to perform data analytics on the stored data using Hadoop, a new HDFS cluster needs to be created by copying (or moving) data stored within NFS into the new HDFS cluster. The newly created HDFS cluster requires not only dedicated infrastructure (e.g., compute nodes, storage devices, etc.), but also explicit copy management to ensure all the copies of a given data within the network remain same. Further, in such networks, any data analytics on the data stored within NFS can only be performed after the copying (or moving) completes and the data is fully available at the new HDFS cluster.
Thus, prior distributed file systems lack efficient data management techniques. There exists a need for efficient data management techniques that addresses at least some of the issues raised above.