Disaster recovery (DR) has been used in the output of system failures. Most DR processes take longer time to recover. With the advent of disk based targets, still the time to recover is significant. These processes do not meet the RTO's (recovery time objective) and RPO's (recovery point objective) of today's business needs. Although block based backup may meet the RPO with faster incremental backups through change block tracking and avoiding file system walking. It still does not reduce the time for disaster recovery as data has to be copied from the target back to the server.
Block based backup copies residing on data domain disks are created in the virtual hard disk (VHD) or VHDx format. This format allows one to attach the VHDx backup copy to the native client machine through various protocols, such as common Internet file system (CIFS) protocol or DDBoost™ protocol from EMC Data Domain®, both providing access over TCP networks. During disaster recovery, it still requires blocks to be copied from the VHDx files to the host disks, which takes a relatively longer period of time. Sometimes such a long time is unacceptable during the recovery.
In addition, backup copy represents a point in time (PIT) snapshot of the application/business data. Business data comprises of the following types: relational databases, mail, file system, and virtual machines (VMs), etc. Data analytics can also be done on backup copies if it is available in a format directly consumable by analytics software. This trend is also referred to as offline data mining, i.e. data mining performed on PIT view of the data (e.g., backups). Once the backup is done still there are technical challenges to facilitate clustered data analysis.
In the new age of data analytics the above problem may be targeted using frameworks like Hadoop™, where data is stored across multiple machines using high definition file system (HDFS). Example products of HDFS include OneFS™ from EMC Isilon™, where data is stored across multiple machines. Using HDFS as a backend multiple clusters can be configured to access the same dataset and perform data analysis in parallel or cluster. This requires additional system resources and adds additional overhead in terms of system complexity.