A storage system is a processing system adapted to store and retrieve information/data on storage devices (such as disks). The storage system includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the storage devices. Each file may comprise a set of data blocks, whereas each directory may be implemented as a specially-formatted file in which information about other files and directories are stored.
The storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and access requests (read or write requests requiring input/output operations) and may implement file system semantics in implementations involving storage systems. In this sense, the Data ONTAP® storage operating system, available from NetApp, Inc. Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL®) file system, is an example of such a storage operating system implemented as a microkernel within an overall protocol stack and associated storage. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
A storage system's storage is typically implemented as one or more storage volumes that comprise physical storage devices, defining an overall logical arrangement of storage space. Available storage system implementations can serve a large number of discrete volumes. A storage volume is “loaded” in the storage system by copying the logical organization of the volume's files, data, and directories, into the storage system's memory. Once a volume has been loaded in memory, the volume may be “mounted” by one or more users, applications, devices, and the like, that are permitted to access its contents and navigate its namespace.
A storage system may be configured to allow server systems to access its contents, for example, to read or write data to the storage system. A server system may execute an application that “connects” to the storage system over a computer network, such as a shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. The application executing on the server system may send an access request (read or write request) to the storage system for accessing particular data stored on the storage system.
As described above, the storage system may typically implement large capacity storage devices (such as disk devices) for storing data. For improved response to received read or write requests, however, the storage system may also temporarily store/cache particular data in a smaller cache memory in storage system memory for faster access. The cache memory may comprise a memory device having lower random read-latency than a typical storage device and may thus still provide faster data access than a typical large capacity storage device. However, the cache memory may comprise a memory device that is more costly (for a given amount of data storage) than a typical large capacity storage device. Since the storage size of the cache memory is relatively small, data stored in the cache memory must routinely be removed from the cache memory to make space for new data. The storage system may employ cache replacement algorithms that determine which data to retain and which to remove from cache memory.
Thus, the storage system may implement a cache memory in the storage system memory to provide faster responses to received read or write requests. In addition, the storage system may implement various methods for saving storage space on the storage system. For example, the storage system may also implement deduplication methods when storing data on the storage devices. Deduplication methods may be used to remove redundant data and to ensure that only a single instance of the same data is stored on the storage devices. Rather than storing multiple copies of the same data on the storage devices, a single instance of the data is typically stored and referenced/indexed multiple times. Since redundant data is removed, deduplication of data typically saves storage space.
However, indiscriminate deduplication of data may cause longer read latencies when reading data that has been deduplicated. For example, when a file to be written to the storage devices is received, any blocks of the received file that match any blocks currently stored in the storage devices are typically considered redundant blocks and are deduplicated (i.e., are deleted from or not stored to the storage devices and a reference/index to the address location of the matching stored blocks is produced in their place). Any non-redundant blocks in the received file are written to the storage devices. When a read request for the received file is later received, the storage system performs the read request by retrieving the stored non-redundant blocks and, for each redundant block, uses the reference/index produced for the redundant block to seek and retrieve its matching stored block.
However, when the storage devices comprise disk devices, the matching stored blocks may be written on particular tracks of a platter of the disk device, whereas the non-redundant blocks of the received file are typically written on different tracks of the disk device. When reading blocks from the same track, a read/write head of the disk device typically exhibits low latency times as it may quickly retrieve the blocks sequentially from the same track. When reading blocks from different tracks, however, a read/write head of the disk device incurs significant seek times each time it repositions onto a different track to retrieve a block of data.
If indiscriminate deduplication of data is performed on a single-block basis (whereby each individual block found to be redundant is deduplicated), later reading of the received file may incur significant read latency if the read/write head frequently seeks and retrieves single blocks stored on different tracks. For example, later reading of the received file may comprise retrieving non-redundant blocks on a first track, seeking and retrieving a single matching stored block on a second track, then seeking and retrieving non-redundant blocks on the first track, then seeking and retrieving a single matching stored block on the second track, etc. As such, use of deduplication on a single-block basis on a disk device may later cause significant read latency as the read/write head of the disk device repositions back and forth between different tracks to seek and retrieve single matching blocks.
Currently, deduplication methods have been developed to avoid such indiscriminate deduplication of data that increases read latencies. For example, some deduplication methods may require a predetermined threshold number (THN) of sequential blocks before deduplication is performed. Such deduplication methods may avoid the significant read latency incurred by indiscriminate deduplication.
If any data blocks are deduplicated on the storage devices, the same data blocks are also typically deduplicated in the cache memory of the storage system. When deduplicating data blocks in a cache memory, only a single instance of redundant blocks may be stored in the cache memory. Deduplication of data blocks in the cache memory may similarly provide storage savings in the cache memory. Since the storage size of the cache memory is relatively small, any storage savings realized in the cache memory is particularly beneficial. Typically, however, data blocks in cache memory are deduplicated based only on the deduplication of data blocks on the storage devices, and further deduplication processing of the data blocks in cache memory is not performed. As such, further deduplication of data blocks and storage savings in the cache memory are not realized by conventional deduplication methods.