1. Field of the Invention
This invention relates to computer network attached storage systems, and, more particularly, to efficiently removing duplicate data blocks at a fine-granularity from a storage array.
2. Description of the Related Art
As computer memory storage and data bandwidth increase, so does the amount and complexity of data that businesses manage. Large-scale distributed storage systems, such as data centers, typically run many business operations. Computer systems that include multiple clients interconnected by a network increasingly share a distributed storage system. If the distributed storage system has poor performance or becomes unavailable, company operations may be impaired or stopped completely. Such distributed storage systems seek to maintain high standards for data availability and high-performance functionality. As used herein, storage disks may be referred to as storage devices as some types of storage technologies do not include disks.
Shared storage typically holds a large amount of data, which may include a substantial quantity of duplicate data. This duplicate data may result from an accidental process such as independent end-users copying the same data. In addition, a deliberate process such as creating a backup or replicating data may cause the duplicate data. In other cases, duplicate data is simply incidental, such as when a shared storage system holds multiple virtual machine files, all of which are derived from a common template. Whatever the cause, removing duplicate data, and therefore reducing the amount of storage utilized and reducing data transferred during backups and other data transfer activities, may increase performance. Additionally, reducing the amount of redundant data stored may improve storage efficiency and may reduce overall costs. Further, improved efficiencies may in turn enable the use of more expensive storage technologies, which may provide improved performance.
One example of a relatively cheap storage technology is the hard disk drive (HDD). HDDs generally comprise one or more rotating disks, each coated with a magnetic medium. These disks typically rotate at a rate of several thousand rotations per minute. In addition, a magnetic actuator is responsible for positioning magnetic read/write devices over the rotating disks. On the other hand, an example of a relatively expensive storage technology is Solid State Storage or a Solid-State Disk (SSD). A Solid-State Disk may also be referred to as a Solid-State Drive. SSDs may emulate an HDD interface, but utilize solid-state memory to store persistent data rather than electromechanical devices such as those found in a HDD. For example, an SSD may use Flash memory to store data. Without moving parts or mechanical delays, such an SSD may have lower read access latencies than hard disk drives. In some cases, write latencies for a solid state devices may be much greater than read latencies for the same device. No matter what technology is used for storage, deduplication is often desired to improve storage efficiency. In many storage systems, software applications such as a logical volume manager or a disk array manager are used to allocate space on mass-storage arrays. However, these applications generally operate and provide mappings at a relatively coarse level of granularity. Consequently, locating and removing duplicate data may be limited to relatively large chunks of data, which in turn may lead to inefficient deduplication. Additionally, while deduplication can improve storage efficiency, deduplication can also slow down certain storage related operations—such as write requests. The results of deduplication may also cause storage-related operations such as reads to run more slowly. Consequently, when and how deduplication is performed is important as well.
In view of the above, systems and methods for efficiently removing duplicate data blocks at a fine-granularity from a storage array and subsequently accessing them efficiently are desired.