In the present day, data de-duplication processes are used to improve storage utilization by reducing the amount of data written to a drive, as well as to reduce the number of data bytes sent across a link during network data transfers.
In the de-duplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is a factor of the chunk size), the amount of data that must be stored or transferred can be greatly reduced.
Many advantages accrue to de-duplication of data in storage systems. In solid state storage systems, such as those that employ solid state NAND memory, it is well known that memory degradation occurs after a finite amount of write operations are performed on devices that constitute the NAND memory and other non-volatile memory technologies. Accordingly, the use of data de-duplication may improve the endurance of the memory because more unique data can be written during the lifetime of the NAND device due to the reduction in duplication. In addition, extra storage space may be created, which can be used in a solid state device as “shuffle space” for improving write input/output operations per second (IOPS). Furthermore, power consumption is reduced to the extent that the data de-duplication process reduces NAND write energy and device input/output power. Bus bandwidth and write speed to a solid state NAND memory are also improved because of the reduced amount of data to be written.
A conventional approach for data de-duplication involves the use of hash algorithms (HA), which produce hash digests, whose size is typically in the range of 32 Bytes or smaller. By comparing just the hash digests, the determination of whether one data block is identical to another can be performed quickly. However, a 32 Byte hash digest results in about 7% overhead for a typical memory block size of 512 Bytes.
In the realm of virtual machines (VM), applications and processes create copies of memory, which can be shared to reduce memory capacity needs. Memory de-duplication can reduce capacity needs that more applications or VMs to run on the same machine. Additionally, the efficient sharing afforded by data de-duplication can enable more effective use of caches and reduce the energy required in maintaining multiple copies of shareable memory. However, current software approaches to sharing typically create software overhead and are inefficient.
It is with respect to these and other considerations that the present improvements have been needed.