The present invention relates generally to storage systems and, more particularly, to a method and an apparatus for the management of scope of deduplication.
Recently, the use of virtual servers has been popularized in enterprises. Server virtualization realizes the improvement of manageability and server resource utilization as well as quick deployment of servers. With server virtualization, multiple virtual servers (i.e., virtual computing machines) can run on a single physical server.
Each virtual server image that is data enabling to establish the virtual server can be categorized and have some similarity in each category, because virtual server images for servers using the same software (e.g., OS and application software) are similar to each other. Therefore, to prepare a large number of virtual server images, writable snapshot provided by storage systems is applied. With this method, an original model of a virtual server image having a specific set of software such as OS and application software is prepared as “Gold Image” first, and then multiple snapshots of the “Gold Image” are created as bases of virtual server images. For the deployment of the images as actual virtual servers, additional modification is performed to each snapshot because the virtual servers have custom setup and variation. The difference data among the images is stored with the virtual server images in the storage systems. When the number of virtual server becomes large, the total amount of the data can be huge.
In order to avoid the complexity of data management and excessive storage cost, data reduction is required and deduplication is used as a method to reduce the amount of data possessed by enterprises. As shown in U.S. Pat. No. 7,870,105, with the deduplication technique, data to be stored in a storage system is compared with each other, and one is replaced with link information that indicates the other if these data are identical. By using this technique, the amount of data stored in the storage system can be reduced. See also, U.S. Patent Publication US2010/0199065. The entire disclosures of these two applications are incorporated herein by reference.
Because the comparison process for a large amount of data causes excessive use of computing resources such as processors and memory (including a key table for the comparison), a manner to limit the target (scope) of the comparison in the deduplication process is necessary to achieve reasonable use of the resources and fine performance of deduplication. In addition, the limitation should be realized according to data classification from the equivalent or similar perspective in order to gain effectiveness of data reduction by deduplication.