Network based storage, or simply “network storage”, is a common approach to backing up data, making large amounts of data accessible to multiple users, and other purposes. In a network storage environment, a storage server makes data available to client (host) systems by presenting or exporting to the clients one or more logical containers of data. There are various forms of network storage, including network attached storage (NAS) and storage area network (SAN). In a NAS context, a storage server services file-level requests from clients, whereas in a SAN context a storage server services block-level requests. Some storage servers are capable of servicing both file-level requests and block-level requests.
In a large-scale storage system, such as an enterprise storage network, it is common for certain data to be stored in multiple places in the storage system. Sometimes this duplication is intentional, but often it is an incidental result of normal operation of the storage system. Therefore, it is common that a given sequence of data will be part of two or more different files, for example. “Data duplication”, as the term is used herein, generally refers to unintentional duplication of data in a given storage device or system. Data duplication generally is not desirable, because storage of the same data in multiple places consumes extra storage space, which is a valuable and limited resource.
Consequently, storage servers in many large-scale storage systems have the ability to “deduplicate” data, which is the ability to identify and remove data duplication. In one known approach to deduplication, any extra copies of a given sequence of data are deleted, and any references (e.g., pointers) to those duplicate sequences are modified to refer to the one remaining instance of that sequence. A result of this process is that a given sequence of data may end up being shared by two or more files (or other types of logical data containers).
Deduplication is closely related to another technique used to reduce the physical storage needed for a given amount of data, namely data compression (also known as file compression). The primary difference between data compression and deduplication is the scale at which they operate. Data compression typically operates by removing duplicate data sequences as short as two bytes, while deduplication typically operates on duplicate sequences of length 1 KB to 4 KB. At the smaller scale of data compression, the size of the metadata needed to reconstruct the duplicate data sequences becomes an overriding issue in its effectiveness. Advanced techniques such as arithmetic coding are generally needed to reduce the size of the metadata. In data deduplication, on the other hand, the metadata is so much smaller than the eliminated duplicate data that it does not significantly contribute to reducing the space efficiency.
The main consideration in choosing the size (or average size) of the data segments in a deduplication method is the resulting deduplication ratio, which is the percentage reduction in physical storage requirements. The space used by the deduplication metadata is overhead that reduces the deduplication ratio. Usually, the size of the metadata is proportional to the number of data segments. This means that smaller segments, which result in a larger number of segments being required, cause a corresponding increase in the metadata. On the other hand, choosing a smaller segment size results in more duplication being found and eliminated. A balance must be struck, therefore, between two extremes: 1) choosing segment sizes so small that the large number of segments causes the metadata size to significantly reduce the deduplication ratio, and 2) choosing segment sizes so large that large amounts of duplication are not detected, also reducing the deduplication ratio.
It has been observed that deduplication segment sizes in the range of 1 KB to 4 KB tend to provide the highest deduplication ratios, given typical metadata implementations that require a fixed number of bytes per segment, usually around 15 to 30. The implications of this observation are important when deduplicating data across a storage server cluster.
A storage server cluster is a storage system formed from a set of cooperating storage server nodes connected by a switching fabric. This type of configuration can be used to increase storage capacity, performance, or reliability. Often the goal of such a system is to provide a seamless collection of physical storage in which the end users need not be concerned with which node contains any particular piece of data. This property is called location transparency. Within the context of such a system, conventional deduplication techniques can be used to remove duplication that occurs within any one node of the system. However, it would improve the storage efficiency of the system to remove or avoid duplication that occurs between different nodes of the cluster as well.
One way to achieve this is to arrange that a newly written data segment that is identical to an existing data segment be assigned to the same node as the existing one, so that they can be deduplicated using conventional techniques. The best known way to achieve this is to route data segments to nodes according to their content, so that a second data segment with the same content will be sent to the same node as the first copy. This usually involves the use of a mechanism that hashes the content of the segment to a node identifier.
A related problem, in a location-transparent storage cluster, is identifying the node that has a particular data segment requested by a read operation. A read operation will request data by its logical position within the overall data set (its address), not by its content or its node location. Therefore, a hash function that maps data content to a node number is not useful in locating data. Instead, a database can be maintained that maps from a logical data segment address to the node that contains that data segment. The size of this database can be significant, as will now be explained.
The desire for small data segments to improve the deduplication ratio conflicts with the desire for an efficient method to find the node location of each data segment. The problem is one of scale. Current storage clusters might have hundreds of nodes, each containing several terabytes of storage. Such a cluster may thus have a capacity of many petabytes of storage. With a segment size of 1 KB, the storage cluster could contain several trillion data segments. Given these numbers, a database that maps data segment addresses to node numbers would have to be enormous, many times more than the maximum amount of RAM that can be installed in today's servers.
Another problem with the use of small segments is the inefficiency of having to gather data from multiple nodes when servicing a read request. If, for example, each 1 KB segment is stored on a different node of the cluster, this implies that a 16 KB read operation requires communicating with up to 16 different nodes. Having to read segments from multiple nodes to service a given read request can cause significant network latency and bandwidth usage within a cluster.