A storage system is a processing system adapted to store and retrieve information/data on storage devices (such as disks). The storage system includes a storage operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on the storage devices. Each file may comprise a set of data blocks, whereas each directory may be implemented as a specially-formatted file in which information about other files and directories are stored.
The storage operating system generally refers to the computer-executable code operable on a storage system that manages data access and access requests (read or write requests requiring input/output operations) and may implement file system semantics in implementations involving storage systems. In this sense, the Data ONTAP® storage operating system, available from Network Appliance, Inc. of Sunnyvale, Calif., which implements a Write Anywhere File Layout (WAFL®) file system, is an example of such a storage operating system implemented as a microkernel within an overall protocol stack and associated storage. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
A storage system's storage is typically implemented as one or more storage volumes that comprise physical storage devices, defining an overall logical arrangement of storage space. Available storage system implementations can serve a large number of discrete volumes. A storage volume is “loaded” in the storage system by copying the logical organization of the volume's files, data, and directories, into the storage system's memory. Once a volume has been loaded in memory, the volume may be “mounted” by one or more users, applications, devices, and the like, that are permitted to access its contents and navigate its namespace.
A storage system may be configured to allow server systems to access its contents, for example, to read or write data to the storage system. A server system may execute an application that “connects” to the storage system over a computer network, such as a shared local area network (LAN), wide area network (WAN), or virtual private network (VPN) implemented over a public network such as the Internet. The application executing on the server system may send an access request (read or write request) to the storage system for accessing particular data stored on the storage system.
The storage system may implement deduplication methods when storing data on the storage devices. Deduplication methods may be used to remove redundant data and to ensure that only a single instance of the same data is stored on the storage devices. Rather than storing multiple copies of the same data on the storage devices, a single instance of the data is typically stored and referenced/indexed multiple times. Since redundant data is removed, deduplication of data typically saves storage space.
Deduplication of data, however, may also cause longer read latencies when reading data that has been deduplicated (e.g., as compared to performing sequential read accesses on a file that has not been deduplicated). For example, when a file to be written to the storage devices is received, any blocks of the received file that match any blocks currently stored in the storage devices are typically considered redundant blocks and are deduplicated (i.e., are deleted from or not stored to the storage devices and a reference/index to the address location of the matching stored blocks is produced in their place). Any non-redundant blocks in the received file are written to the storage devices. When a read request for the received file is later received, the storage system performs the read request by retrieving the stored non-redundant blocks and, for each redundant block, uses the reference/index produced for the redundant block to seek and retrieve its matching stored block.
However, when the storage devices comprise disk devices, the matching stored blocks may be written on particular tracks of a platter of the disk device, whereas the non-redundant blocks of the received file are typically written on different tracks of the disk device. When reading blocks from the same track, a read/write head of the disk device typically exhibits low latency times as it may quickly retrieve the blocks sequentially from the same track. When reading blocks from different tracks, however, a read/write head of the disk device incurs significant seek times each time it repositions onto a different track to retrieve a block of data.
Since deduplication of data is typically performed on a single-block basis (whereby each individual block found to be redundant is deduplicated), later reading of the received file may incur significant read latency if the read/write head frequently seeks and retrieves single blocks stored on different tracks. For example, later reading of the received file may comprise retrieving non-redundant blocks on a first track, seeking and retrieving a single matching stored block on a second track, then seeking and retrieving non-redundant blocks on the first track, then seeking and retrieving a single matching stored block on the second track, etc.
As such, conventional use of deduplication on a single-block basis on a disk device may later cause significant read latency as the read/write head of the disk device repositions back and forth between different tracks to seek and retrieve single matching blocks. As such, there is a need for a method and apparatus for utilizing deduplication of data on disk devices that mitigates the later read latency of the data.