Data deduplication may be characterized as a specialized data compression technique for eliminating duplicate copies of repeating data, thereby reducing the amount of storage needed for a given quantity of data. A current problem faced by entities that store data is the complicated, time-consuming process that is typically involved in evaluating and comparing vendor designs of deduplication and compression systems employed in data storage. Another problem currently facing such entities is the need to protect its confidential information, as well as the legal duty of such entities to protect personally identifiable information (PII), stored in its data storage systems from exposure outside the entity.
The types of systems that such entities may be interested in evaluating are typically data storage systems designed to store large amounts of data. Such data storage systems may be referred to in the industry as block storage devices. Such block storage devices may include, for example, both disc-storage systems and flash-based storage systems. In a block storage device, each individual data element may have a particular size, such as 4096 bytes, and each individual block of data stored on that storage device is accessible by a unique address that may be referred to as a logical block address (LBA)
Typically, an entity, such as a financial institution, health care organization, or government agency may not be permitted to take copies of data that contains PII outside of the entity. Therefore, it is typically not possible for such entities to allow use of copies of data at a vendor site or in a laboratory environment that is not controlled by the entity for evaluation of the vendor's system designs. A traditional approach to this problem has been for an entity to engage with a systems vendor and have the vendor provide its product to the entity for evaluation. The entity may then install the vendor's product in one of the entity's own facilities and perform an evaluation of the vendor's system design by and under the entity's control. That traditional evaluation process may typically take up to six or more months to complete.
There is a present need for methods and systems that enable rapid evaluation of potential vendor designs of deduplication and compression systems and that ensure an apples-to-apples comparison of competing designs. There is presently a further need for methods and systems that assure that the results of tests to evaluate those designs are valid against one another and that they do not expose any PII or confidential information of entities that is stored in the systems of such entities.
Other types of systems which entities may be interested in evaluating may comprise, for example, file systems including, without limitation, disc-based file systems, network-based file systems, and virtual file systems. A relatively simple example of a file system may be a C: drive of a computer having WINDOWS® operating system installed. The C: drive of the computer and all of the data and files on the C: drive may be characterized as an example of a single self-contained file system that makes no reference to anything outside itself.
A common occurrence may be copying the same file multiple times in multiple different places on the C: drive. Consequently, the same data may be stored multiple times in the file system of the computer. Similarly, in the case, for example, of a file server in an organization, as documents are sent back and forth between various people in the organization, such documents may be repeatedly saved and resaved. Such repeated storage of copies of the same data is a significantly inefficient use of storage space.
There is also a present need for deduplication and compression methods and systems for evaluating file systems to enable recognition of data that was already stored and to store, for example, only a reference to such data.