Subtree backup is a common practice to protect user data since not all files in a computer system are equally important to a user. For example, a user might not want to back up the operating system (OS) image file, program cache, status file, etc. For a physical machine, a user can install a backup agent to achieve subtree backup easily. However, in a virtual environment, it is not possible or practical to install a backup agent in all virtual machines (VMs). Currently, a conventional VM backup solution requires mounting the VM to a proxy server for subtree backup and such a solution is inconvenient and inflexible.
A VM can be protected in multiple ways (e.g., an image level backup and a file level backup). One of the disadvantages of protecting a virtual disk file as a single file is that each backup requires the same size as the virtual disk (even if only minimal changes are present in the VM between backups). The recovery of files from such a backup requires additional and complex operations to mount the virtual disk as a guest file system using a third party tool and an ability to determine/recover only the specific types of files. Another disadvantage of a conventional subtree backup requires either deploying a backup agent in every VM or mounting the VM to a proxy server. Complexity of deployment depends on the size of the virtualization environment.
A file is a basic unit which an end user wants to protect for both physical and virtual machines. Currently, there are a variety of different methods to protect files in a virtual environment and each method has its advantages and disadvantages. For example, a backup agent can be installed in a VM like a physical machine. This solution is the simplest method since it does not require any new design from backup software. However, such a solution does not scale well in a virtual environment.
Virtualization vendors such as VMware provide a set of application programming interface (API) (e.g., VADP) for a backup application to mount a VM file system to a remote host. In this situation, the backup application mounts a VM to a proxy server and performs a file level backup on the mounted file system. On incremental backup, the backup software walks the file system and finds which of the files that has been changed. However, walking the file system is slow and it is also inconvenient to mount a VM in a proxy server.
Another conventional method uses a changed block tracking (CBT) feature provided by a virtual machine monitor (VMM) to keep track of data blocks changed. Under this approach, backup application does not need to mount the VM to a proxy server. Typically, it will pre-parse a virtual disk file to generate file index and use a VM backup API, such as VDDK available from VMware, to read the virtual disk file from the VMM and send data to backup target storage. For incremental backup, the CBT is used to generate a list of changed blocks between two snapshots and only the changed blocks will be backed up. This approach backs up only the changed blocks between backups and as a result, it is very likely that a partial file will be backed up on incremental backup. A user will need to read from multiple backups in order to recover a full file. If the backup target is traditional media such as tape, the recovery process could be very slow and costly. Furthermore, not all virtualization vendors provide the feature of CBT, which limit the effective usage of this approach. Often, the entire virtual disk file has to be backed up.