The exponential growth of digital information, credited to faster processors, lower cost of digital data storage, increasing availability of high data rate access, and development of new applications has increased the demand for computer storage. This increased dependence on computer data and data storage creates a need for more efficient data analysis technology.
With the increasing availability of low-cost, high-volume data storage devices, an increasing amount of data can be stored on an individual logical storage device, such as a physical disk drive, tape drive, or optical drive. Consumer hard drives, for example, have recently exceeded a terabyte of data storage capacity to meet the increasing demands for electronic storage. However, efficiently managing large amounts of data is burdensome and costly.
Technologies, such as de-duplication, have attempted to address some of the burden of managing large amounts of data by eliminating redundant data to improve storage utilization. In the de-duplication process, duplicate data on a logical storage device is deleted, leaving only one copy of the data, along with references to that one copy of the data. De-duplication is able to reduce the required storage capacity since only the unique data is stored. Each subsequent instance of duplicated data is simply referenced back to the one saved copy.
To maximize the benefits of de-duplication, it is advantageous to aggregate, to a single logical storage device, data files having maximum similarity to one another. However, it is time-consuming, and computationally intensive to compare each data block of one file, for example, to each data block of another file to determine the similarity between the two files. The computational complexity is further increased with larger files that may be associated with a Virtual Machine (VM).
A VM is normally represented as a set of files, including one or more configuration files and one or more disk image files. A configuration file stores configuration (settings) of the virtual machine. A virtual machine disk image file represents the operating system and data contained within the virtual machine and itself typically includes numerous individual files. A single VM disk image file can exceed several gigabytes of storage space, and a single logical storage device can contain numerous VMs, up to the capacity of the storage device.
Previous efforts to determine similarity between files relied on a “brute force” method. The brute force method utilizes set similarity based on determining both an intersection and union of all data blocks of each file undergoing comparison. For example, to determine similarity between VM ‘A’ and VM ‘B’, the following “brute force” equation has been utilized: S(A,B)=|A∩B|/|A∪B|, where ‘A’ is the set of data blocks (or corresponding hash values) of VM ‘A’, ‘B’ is the set of data blocks (or corresponding hash values) of VM ‘B’, ‘∩’ is the intersection operator, and ‘∪’ is the union operator. However, this technique is time consuming and processor intensive, especially for larger files because every block of the file is processed.
Other techniques utilize a Broder equation to attempt to offset the brute force method, by comparing only a subset of sorted data blocks of the files being compared. This technique avoids some of the issues of the brute force method by limiting the determination of the intersection of data blocks of the files to a predetermined number (k) of data blocks and eliminating the determination of the union of all data blocks of each file being compared.
One limitation with the Broder technique of computing file similarity is its inability to efficiently re-compute the similarity of two files previously compared. Re-computing files' similarity is appropriate after data blocks on one or both of the files change. A change can occur after an existing block is removed or modified, or a new block is created in the file. Previous techniques lacked the ability to incrementally adjust the previously computed data similarity without re-computing the Broder equation, which involves at least resorting all of the data blocks of the files and introduces the problems associated with the brute force method.
Another problem with the Broder technique is that it introduces variance in the accuracy of the similarity comparison. Under the Broder technique, accuracy of the similarity comparison is a function of the number (k) of sorted data blocks utilized in the similarity comparison. The lower the number (k), the less accurate the similarity comparison will be. The higher number (k) of sorted data blocks, the greater the accuracy of the similarity comparison will be. However, increasing the number (k) results in the original problem of the “brute force” method where the computational complexity and time commitment exceeded the usefulness of computing the similarity.
Therefore, the problems of computational complexity, high time-commitments, and poor accuracy when incrementally determining a similarity comparison of large files thus far has not been addressed and hinders current efforts to efficiently utilize data storage devices to manage and organize electronic information.