A scale out storage system comprises a plurality of nodes connected by a network. Each node is equipped with a processor, a memory, and a number of storage devices. The storage devices may be hard disk drives (HDDs), solid-state devices (SSDs), or a combination of both (Hybrid). The storage devices may be configured under a RAID (Redundant Array of Inexpensive Disks) hardware or software for data redundancy and load balancing. The storage devices may be local to each node or shared among multiple nodes. The processor may be dedicated to running storage software or shared between storage software and user applications. Storage software, such as a logical volume manager, provides storage virtualization, capacity reduction, scale out, high availability, mobility, and performance.
Storage virtualization decouples the logical devices addressed by user applications from the physical data placement on the storage devices. Storage virtualization allows the processor to optimize physical data placement based on the characteristics of the storage devices and provide capacity reduction such as data deduplication. User applications address a logical device by its Logical Unit Number (LUN). A logical data block associated with a logical device is identified by a logical block number (LBN). Thus, a complete logical address for a logical data block comprises the LUN of the logical device and LBN for the logical block. To support storage virtualization, the processor translates each user I/O request addressed to a LUN and a LBN to a set of I/O requests addressed to storage device IDs and physical block numbers (PBNs). That is, the software translates the logical addresses of the logical data blocks into corresponding physical addresses for the physical data blocks stored in the data storage devices. In some storage software implementations, in order to perform this translation, the processor maintains forward map metadata that maps each data block's LBN to its PBN. To support data deduplication the processor maintains deduplication metadata that maps each data block's fingerprint (a hash of the block's contents) to its PBN. Additional metadata may be maintained in support of other data services such as compression and snapshot.
A data block is the smallest storage unit that the processor manages via its metadata. The size of the data block can be as small as 4 KB or as large as an entire volume. There are advantages in employing small data block sizes in order to optimize data placement and increase deduplication ratio. The size of the forward map metadata is determined by the data block size and the usable capacity of the storage system. On a small capacity storage system with a large data block size, the entire metadata may be small enough to be cached in the memory for fast access and stored persistently on the storage devices. However metadata is becoming increasingly larger driven by larger physical capacity and smaller data block sizes. Data services such as deduplication, compression, and snapshot also increase the metadata size by many folds by increasing the usable capacity of the system. In the case where the memory is not large enough to cache the entire metadata, the metadata is stored persistently on the storage devices, with a portion of it cached in the memory. Caching is only effective when metadata access has locality of reference—real world user applications tend to access related logical device addresses frequently. User application locality of reference allows the processor to cache frequently accessed metadata entries in the memory without significant loss of performance. Without user application locality of reference, caching simply devolves into thrashing, which exhausts system resources and slows down performance.
Scale out is a key requirement for a scale out storage system. One example of scale out is add-a-node where a new node is added to the storage system to provide more storage capacity and performance. Another example of scale out is remove-a-node where an existing node is removed from the storage system. In both cases a large number of data blocks need to be moved from their current physical locations to new locations in order to redistribute data blocks across all available capacity and bandwidth. Scale out is expected to be transparent to user applications—change in a data block's physical location should not affect its LUN/LBN addressed by user applications. In some storage software implementations, the processor maintains reverse map metadata that maps every physical data block's PBN to the LBNs that reference it. As part of moving a data block from PBN1 to PBN2, the processor first looks up PBN1 in the reverse map metadata to identify all the LBNs that reference PBN1. It then looks up these LBNs in the forward map metadata and changes their reference from PBN1 to PBN2. The processor then goes back to the reverse map metadata to change PBN1 to PBN2. If deduplication is enabled, the processor determines the fingerprint of the data block and updates the fingerprint's entry in the deduplication metadata from referencing PBN1 to referencing PBN2. Given that this data movement in support of scale out is not originated by a user application and therefore does not benefit from user application locality of reference, these numerous accesses to reverse map, forward map, and deduplication metadata cannot be effectively cached in the memory, causing the system to thrash.
Logical device availability refers to making a logical device available on node B in the event that its original host node A fails. Logical device mobility refers to moving a logical device from node A to node B for load balancing. Both logical device availability and mobility can be measured by time to access and time to performance. Time to access is defined as the time it takes for the logical device to support the first user I/O on the node B. Time to performance is defined as the time it takes for the logical device to restore its original performance. For storage software implementations that support storage virtualization through forward map metadata, time to access is relatively long as the forward map metadata needs to be moved from Node A to Node B.
Providing high performance is challenging for a scale out storage system as data blocks are distributed across multiple nodes and remote access incurs network latency. Some storage software implementations try to mitigate this network latency issue by placing most of the data blocks referenced by a logical device on the same node as the logical device, known as data locality. Data locality poses a number of issues. First of all, logical devices are often not load balanced themselves across the plurality of nodes, leading to unbalanced data block placement in terms of capacity and performance. Secondly in the event that a logical device is moved, most of its data blocks need to be moved to the new node, resulting in long time to performance.
In view of the above, there is a need for more efficient metadata management in support of storage virtualization, capacity reduction, scale out, high availability, mobility, and performance.