A storage system is a processing system adapted to store and retrieve information/data on storage devices, such as disks or other forms of primary storage. The storage system includes a storage operating system that implements a file system to organize information into a hierarchical structure of storage objects, which may be directories or files. These structures organize and track data. For example, each file typically comprises a set of data blocks, and each directory may be a specially-formatted file in which information about other files and directories are stored.
The storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and access requests (read or write requests requiring input/output operations) and supports file system semantics in implementations involving storage systems. The Data ONTAP® storage operating system, available from NetApp, Inc. of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL®) file system, is an example of such a storage operating system implemented as a microkernel within an overall protocol stack and associated storage. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows®, or as a general-purpose operating system configured for storage applications.
Storage is typically provided as one or more storage volumes that comprise physical storage devices, defining an overall logical arrangement of storage space. A storage volume is “loaded” in the storage system by copying the logical organization of the volume's files, data, and directories, into the storage system's memory. Once a volume has been loaded in memory, the volume may be “mounted” by one or more users, applications, devices, and the like, that are permitted to access its contents by reading and writing data to the storage system.
An application, server or device may “connect” to the storage system over a computer network, such as a shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. Access requests (read or write requests) travel across the network to the storage system for accessing data stored on the storage system.
Optionally the storage system may implement deduplication methods that remove redundant data from the storage system to ensure that only a single instance of the same data is stored on the storage devices. To this end, the deduplication method stores a single instance of the data that is referenced/indexed multiple times. Since redundant data is removed, deduplication of data typically saves storage space. Deduplication typically works by comparing a file to be written to the storage devices with the data blocks currently stored in the storage devices. Any matching blocks are deemed redundant blocks and are deduplicated (i.e., are deleted from or not stored to the storage devices and a reference/index to the address location of the matching stored blocks is produced in their place). Any non-redundant blocks in the received file are written to the storage devices.
Deduplication may be performed by producing a content identifier value of each block that represents the data contents of the block. For example, the content identifier value of a block may be determined using a fingerprint, checksum, or hash operations (such as Message Digest 5, SHA, etc.) that produces a fingerprint, checksum, or hash value (content identifier value) representing the data contents of the block. Regardless of the particular content identifier operation used when two blocks have the same content identifier value, there is a high probability that the two blocks have the same data content as well, and thus one block may be deduplicated. Typically, the content identifier of each block may be produced and stored to a content identifier database during a “gathering” phase. For example, during the gathering phase, each block of each file in a file system may be processed to populate the content identifier database. The content identifier database may then be used to identify redundant blocks and deduplicate blocks as necessary.
As helpful as deduplication is, deduplication can also increase read access times. Deduplicating, by its nature, disrupts the sequential arrangement of a file's data blocks on a disk. Instead of all the file blocks being neatly arranged one after the other within one track of the disk, a deduplicated file block may point to a physical data block at a location that is several tracks away. As such, deduplication can increase the number of times the disk head must move to a different track, and such head moves cause substantial delay during data access. At one point, the increase in access time can, practically, outweigh the benefits of deduplication.
As such, there is a need for a more efficient method of processing data files such that the benefits of deduplication are less undermined by the burdens of increased access time.