A conventional virtual machine system provides system snapshot service to users. That is, a complete snap shot is conducted to a virtual machine disk image. A virtual machine snapshot backup system is a sub-system of the virtual machines system and manages all historical data of a virtual machine user at PB level. Thus, increasing storage efficiency of the virtual machine snapshot backup system is very important to reduce virtual machine usage costs for the users and increase storage utilization efficiencies of machine clusters.
To handle large-scale data backup requests in real-time and efficiently exclude duplicate data, the virtual machine snapshot backup system may need to meet at least three conditions. The first one is a high data processing speed such that backup of thousands of virtual machines can be completed within three hours at night every day. The second one is an excellent de-duplication effect to exclude most redundant data (such as removing at least 70% redundant data). The third one is low resource utilization. The virtual machine snapshot system, as the sub-system of the virtual machine system, cannot compete with other important modules of the virtual machine system for resources. Otherwise, user experience of the virtual machine would be affected.
One example conventional technique for de-duplication of the virtual machine snapshot backup is an EBS snapshot store technique which is provided by a cloud computing platform of Amazon™. Details please see http://aws.amazon.com/ebs/. The technique divides each virtual machine disk into a fixed size block with 4 MB and tracks change information of each block during usage. If one block is determined as having no change as of preceding backup snapshot, such a block of data is not backed. Another example conventional technique is a backup de-duplication storage server provided by the special storage technology provider such as ECM™, which divides the backup data into various data block according to content characteristics and detects the redundant data according to Hash check.
The technique of Amazon™ determines the data for backup solely based data revision record of a single virtual machine. It has at least the following disadvantages. First, even though the data in the block is only revised by one byte, the whole data needs backing up. In addition, with respect to the scenario that different users back up the same data, such as an operating system and various frequently used applications, disk location of such data may be not uniform due to differences of user behaviors. The technique of Amazon™ cannot detect such kind of redundant data.
Although the technique of EMC™ may exclude redundant data according to data characteristics, the price of its special storage server is very high and cannot meet the backup requirements of virtual machine clusters at TB level. Neither are such techniques compatible with the cloud computing system with cheap price and huge data volume.