Virtual machines may store their data in virtual devices referred to as virtual storage disks. Virtual storage disks may be implemented as files in the host file system of a virtual machine server. A guest operating system operating within a virtual machine may treat a virtual storage disk as a block storage device, typically creating and using a file system within the virtual storage disk.
Virtual machines are often based on a master virtual machine image (e.g., a golden image) that acts as a template for creating additional virtual machines. In other words, a master virtual machine may be used to create similar or identical virtual machines that include instances of an OS and other applications from the master virtual machine image. A server may include thousands of virtual machines that are derived from a common master virtual machine image. Accordingly, the virtual machines on the server may include significant amounts of duplicate data, and deduplicating data on the server may provide significant storage space savings.
As the virtual machines on the server run, they may write data in the virtual storage disks, and data on the virtual storage disks may begin to diverge. Over time, deduplication on the server may be less and less effective, even when virtual storage disks still have many identical blocks of data if those blocks of data are not arranged in the same sequence within a deduplication-sized chunk of data. Furthermore, deduplication is often performed on relatively large data chunks (deduplication may be more efficient when performed on large chunks), which may result in a significant amount of duplicate data that is not deduplicated. What is needed, therefore, is a more efficient mechanism for handling deduplication in the context of multiple virtual storage disks.