Distributed data processing systems may be used to process and analyze large datasets. One such framework used to efficiently process and analyze large datasets is Hadoop, which provides data storage services to clients using a Hadoop Distributed File System (HDFS) and data processing services though a cluster of commodity computers or nodes. The HDFS executes on the cluster of computers (also called as compute nodes or processing nodes) to enable client access to the data in the form of logical constructs organized as blocks, e.g., HDFS blocks. The compute nodes operate mostly independently, to achieve or provide results toward a common goal.
Such a data management framework enables a distributed data processing system (“system”), e.g., Hadoop, to support critical large scale data-intensive applications. These data-intensive applications, however, require frequent automated system backups with zero or minimal application downtime. As a result, the ability to create a read-only, persistent, point-in-time image (PPI) (also referred to as a “snapshot”) of the files and directories and their associated metadata as they were in a particular point in the past in the system, e.g., Hadoop, becomes important. This capability allows the exact state of the files and directories to be restored from the PPI in the event of a catastrophic failure of the system.
However, many distributed data processing systems, e.g., Hadoop, do not have a robust PPI creation ability as such systems logically maintain the file system metadata and the stored data separately. In Hadoop, a master node (known as the NameNode) maintains HDFS and tracks the file metadata. Further, each stored file in Hadoop is divided into data blocks and replicated across various compute nodes (also known as the DataNodes). When creating a PPI in such a framework, the system needs to not only compare and determine the changes to the HDFS within a given timeframe but also track the state of multiple data blocks and their replicas that are associated with the changes in HDFS. Such a process creates a huge latency.
FIG. 1 illustrates a timeline 100 illustrating the creation of PPIs in a Hadoop system using a known technique. The technique involves traversing through each directory in HDFS to examine each file in each directory to identify files that have been modified, added, or accessed within a given timeframe. The technique utilizes the identified changes and the prior PPI to create a new PPI of HDFS. In FIG. 1, the earliest PPI of HDFS is represented by state n 102, where the state n includes file “a”.
When PPI “Snap1” is created, the technique is utilized to traverse the current HDFS and determine that files “b” and “c” have been added and file “a” has been deleted since the last PPI (i.e. state n 102) was created. The technique then creates a new PPI of HDFS, represented by state n+1 106, by applying the determined changes to the state n 102 PPI of HDFS. Further, the technique tracks and creates PPIs of the multiple data blocks and their replicas that are associated with the changed files “a”, “b” and “c” in HDFS.
For instance, when the DataNode receives a request to create a local PPI of the stored data blocks, the DataNode creates a copy of the storage directory and hard links the existing block files into the directory. So, when the DataNode removes a block, the DataNode only removes the hard link. The old block replicas remain untouched in their old directories. The cluster administrator can choose to roll back HDFS to the PPI state when restarting the system. The DataNode restore the previously renamed directories and initiates a background process to delete block replicas created after the PPI was made. However, once having chosen to roll back, there is no provision to roll forward.
In FIG. 1, when a next PPI of HDFS is created, the technique utilizes the last PPI (i.e. state n+1 106) to determine the changes to HDFS between the last PPI and the time of the current PPI. The latest PPI of HDFS is maintained as state n+2 110, independent of states n and n+1. A user can utilize any of the PPIs 102, 106, 110 to roll back the HDFS and the associated data back to the state HDFS and the associated data existed at the time of the given PPI. Such a technique for PPI creation in Hadoop can thus be not only complex but also very slow.
Accordingly, the known PPI techniques for distributed processing systems are limited in their capabilities and suffers from at least the above constraints and deficiencies.