Embodiments of the present invention relate in general to storage systems and more specifically to providing systems that combine efficient reliable storage and deduplication.
Enterprises generally implement two broad types of storage infrastructure for storing their application data. Direct-attached storage (DAS) includes storage devices or disks that are attached to the systems that run the application workloads. The devices and the stored data in DAS are private and typically in one or two server systems or virtual machines hosted in the physical server hardware. Various high-availability configurations can be configured using server hardware. The data in DAS devices can be replicated to other storage systems, which usually requires replication features provided with certain application software. Networked enterprise storage infrastructure, another type of storage infrastructure, typically includes a storage area network (SAN) and network-attached storage (NAS). SAN and NAS provide the infrastructure to share storage devices and data over fiber channel or Ethernet networks with a large number of server system and applications within the enterprise data centers. The data in SAN can be replicated across two or more data centers for high availability and disaster recover requirements by using the required network infrastructure that spans across geographically dispersed data centers.
Information technology managers have the critical task of protecting application data against loss of data due to hardware failures, security invasion, power outages, and natural disasters, to name a few. Redundant Array of Independent Disks (RAID) is the technology commonly used in DAS, and also in networked storage SAN-based storage subsystems to provide reliable data protection against physical disk failure within an array in the subsystems. When multiple physical disks are set up to use the RAID technology, they are said to be in a RAID array. Although the array itself is distributed across multiple disks, the array is seen by the computer user and operating system as a single disk. The operating system accesses the single logical disk and the RAID adapter handles the data distribution in the multiple disks in the array based on the RAID level with which the array is configured. There are a number of RAID levels including RAID 1, also called mirroring, which writes the same copy of data across all disks. RAID 5 includes block level striping with distributed parity. The parity information is distributed across all the disks in the RAID array. If one disk in the array fails, there is no data loss because all the data can be restored to a replacement disk. RAID 5 typically includes storing each symbol, or block of a codeword on a different disk to support recovery of the codeword if a disk in the array fails.
Business data growth rates are continuing to increase rapidly, and as a result retention and retrieval requirements for new and existing data are expanding, driving still more data to disk storage. As the amount of disk-based data continues to grow, there is an ever-increasing focus on improving data storage efficiencies across the information infrastructure. Data deduplication is a technique for achieving data reduction that consolidates redundant copies of a file or file subcomponent. Incoming or existing data are standardized into “chunks” that are then examined for redundancy. If duplicates are detected, then pointers are shifted to reference a single copy of the chunk and the extraneous duplicates are then released. Chunking refers to breaking data down into standardized units that can be examined for duplicates. Depending on the technology and locality of the deduplication process, these units can be file or more granular components such as blocks. Inline deduplication consolidates data before it is written to disk which prevents duplicate chunks from being written to the same storage unit. For data deduplication that is performed by the storage system to be most effective, blocks having the same content should be steered to the same storage unit.
To support deduplication at the unit of storage, a target storage unit is chosen based on the value of a chunk (symbol or block, stripe, etc.) that is stored to ensure that if the same symbol is stored more than once that all copies are targeted to the same unit of storage. On the other hand, a typical efficient reliable storage system, that implements RAID for example, selects target storage units by ensuring that all the symbols in a codeword are stored on different storage units. These two approaches have different goals, deduplication is aimed at minimizing the amount of data that is stored and redundant storage provides recoverability by adding symbols to a codeword from which the data that is stored can be recovered in the case of failure and spreading the data across multiple storage units. Contemporary storage systems provide either the storage savings of deduplication or the recoverability of redundant storage, but not both.
Accordingly, while storage systems are suitable for their intended purpose the need for improvement remains, particularly in providing storage systems that combine redundant storage and deduplication.