A storage system is a processing system adapted to store and retrieve information/data on storage devices, such as disks or other forms of primary storage. Typically, the storage system includes a storage operating system that implements a file system to organize information into a hierarchical structure of directories and files. Each file typically comprises a set of data blocks, and each directory may be a specially-formatted file in which information about other files and directories are stored.
The storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and access requests (read or write requests requiring input/output operations) and supports file system semantics in implementations involving storage systems. The Data ONTAP® storage operating system, available from NetApp, Inc. of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL®) file system, is an example of such a storage operating system. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows®, or as a general-purpose operating system configured for storage applications.
The storage operating system, will control the process of reading and writing data to a storage device, such as a disk drive, a tape drive, a solid state memory device, a virtual memory device, or some other type of system for storing data. In addition to controlling and supporting data access operations, like data reads and writes, the storage operating system can also organize the data that is stored in memory. This organization process can improve access speed, making it faster to read and write data, and can help reduce the cost of storage, by using the available storage medium more efficiently.
Despite the introduction of less expensive memory devices, such as Serial Advanced Technology Attachment (SATA) disk drives, one of the biggest challenges for storage systems today continues to be the storage cost. There is a desire to reduce storage consumption and therefore storage cost per megabyte by eliminating duplicate data through sharing blocks across files.
One technology to accomplish this goal is a flexible volume that contains shared data blocks. Basically, within one volume, there is the ability to have multiple references to the same data block. Thus, multiple files can share a stored data block that is common to both files, rather than require each file to maintain its own stored copy.
To this end, the storage system may use a deduplication process that performs a duplicate data reduction process by analyzing every block in the volume that has stored data. Each block of data is hashed to generate a digital fingerprint. When deduplication runs for the first time on a flexible volume, it creates a fingerprint database that contains a sorted list of all fingerprints for used blocks in the volume. A separate process compares each fingerprint in the database to all other fingerprints of the flexible volume. If two fingerprints are found to be the same, the system typically performs a byte-for-byte comparison of all bytes in the two blocks and, if there is an exact match between the new block and the existing block on the flexible volume, the duplicate block is discarded and its disk space is reclaimed.
To reclaim the block, the block's pointer is updated to the already existing data block and the new (duplicate) data block is released. Releasing a duplicate data block typically entails updating the logical structure that the storage system uses to track where on the physical volume the data is stored. The deduplication process will increment the block reference count for the maintained location, and free the locations of any duplicate data.
Although these deduplication processes provide powerful tools to improve storage capacity. One of the barriers to adoption of deduplication processes is the amount of user involvement in determining whether a volume will benefit from deduplication and, once a benefit is known, setting up the volume for deduplication.
Another barrier is the impact on the system of having deduplication enabled. Deduplication is a heavyweight process. It demands substantial system resources and essentially prevents other processes from running efficiently.
The systems and methods described herein address both of these issues. These systems provide ease of management of the deduplication process, and can include the effective use of resources, estimation and policy based management.