Many modern storage systems are based on a technique called “content-addressed storage”. The general principle of content-addressed storage is that the content of an item (which may be a file, an object, part of an object, or a block of data) being stored in a data storage is used to define the logical address in the storage at which that item is stored. In such systems, logical addresses are typically generated by using a “hash function” to process item content in order to produce the logical address.
A hash function is a deterministic function that maps a digital bit string in which the bits have arbitrary values and the string has an arbitrary length to another digital bit string with a fixed length. Because the output string of a hash function is of fixed-size and the number of possible input strings is unlimited, it is inevitable there exist at least two different input strings that produce the same output string when processed by the hash function. Two such input strings are said to “collide” and a “collision” is found when two such input strings are identified.
A “collision-resistant” hash function is a hash function for which there is no known tractable method for examining the function and identifying input strings that collide. There are several known collision-resistant hash functions in current use of which the most widely used is called SHA-1. This function is described in detail in Federal Information Processing Standard Publication 180-1 (SHA), April 1995. The SHA-1 function generates a 160-bit hash of an input bit string. However, recently it has been shown to be less collision-resistant than originally believed. See, for example, the discussion by X. Wang, Y. L. Yin, and H. Yu. at website theory.csail.mit.edu/˜yiqun/shanote, February 2005. Another hash function in wide use is called MD5 and is described in “The MD5 message-digest algorithm”, R. Rivest available at website faqs.org/rfcs/rfc1321.html, April 1992. However, this function is no longer believed to be collision-resistant. See “How to break MD5 and other hash functions”, X. Wang and H. Yu, Proceedings of Eurocrypt 2005, May 2005.
In the context of different applications, collision-resistant hash functions are referred to as “message digest” or “fingerprinting” functions. A message digest function produces a message digest as an output while a fingerprinting function produces a fingerprint as an output. For the purposes of the discussion below the term “fingerprinting” will be used. A good fingerprinting function has two important properties. First, if two input strings produce different fingerprints, then the inputs themselves must be different. This property follows from the requirement that the fingerprinting function be deterministic. Second, if two input strings produce the same fingerprint, then it is extremely likely that the two inputs are identical. This property follows from the requirement that the fingerprinting function be collision-resistant.
These two properties provide the basis for the technique known as compare-by-hash or compare-by-fingerprint. This technique treats the fingerprint of an item as its identity. Thus, if two items have different fingerprints, then they must be different, and if they have the same fingerprint then they are assumed to be the same. This technique is very useful in bandwidth-limited systems because arbitrarily large items can be compared by transferring the fingerprints of the items to a common location where they can be compared to determine whether the items are the same rather than requiring the entire items to be transferred. The technique is also useful for commitment protocols.
In particular, the compare-by-hash technique forms the basis for content-addressed storage. In a content-addressed storage system, the fingerprint of an item is used as the logical address of that item in the storage. Depending on the implementation of these systems, an “item” might correspond to a fixed-length data block, a string of arbitrary length (such as a file) or a higher-order object. Since the content of each item defines its logical address, each item stored by the system is stored exactly once. For example, when a large file is copied, no new storage space is required to store the contents of the file copy because the contents of the file copy will have the same fingerprint as the contents of the original file and will thus be stored at the same logical address as the contents of the original file. Thus the cost (in time and space) to copy a large file is reduced to the time required to update the namespace of the file system.
The first well-known implementations of content-addressed storage are the Venti storage system described in “Venti: a new approach to archival storage”, S. Quinlan and S. Dorward, Proceedings of the First USENIX Conference on File and Storage Technologies (FAST'02), pages 89-101, Berkeley, Calif., USA, January 2002, The USENIX Association and the Centera system developed by the EMC Corporation, Hopkinton, Mass. as described in “EMC content addressed storage system” at website emc.com/products/systems/centera.jsp, 2003 (Centera is a trademark of the EMC Corporation). The technique of using content-addressed storage has also been embraced by wide-area replicated storage systems like the Pond and OceanStore systems described in “Pond: The OceanStore prototype”, S.C. Rhea, P. R. Eaton, D. Geels, H. Weatherspoon, B. Y. Zhao, and J. Kubiatowicz, Proceedings of the FAST '03 Conference on File and Storage Technologies, April 2003.
The basic protocol for storing an item in such a content-addressed storage system may be defined in terms of two primitive operations, which are called “lookup” and “put”. These lookup and put operations are discussed below in the context of a client and server storage system. In this context, the client is the initiator of a storage operation. The server is responsible for implementing the data storage and related data structures necessary to implement the functionality required by the content-addressed system. This notation does not imply that the client and the server are running on different machines or even in different processes, but is only meant to describe the underlying protocol. For notational convenience, specific instances of fingerprints are described as either “candidate” or “valid” addresses. All fingerprints that are generated by the client are considered to be candidate addresses. If a fingerprint matches the fingerprint for an item that is stored on the server, then it is a valid address of an item stored on that server.
The purpose of a lookup operation is to determine whether or not the server already contains a copy of the item that the client wishes to store—the client presents the server with a candidate address, and the server tells the client whether or not that address is valid. If the item exists (the address is valid) then there is no need for the client to perform any further action.
A conventional lookup operation has two steps. First, the client computes the fingerprint from the content of the item it wishes to store, and sends the fingerprint to the server. Second, the server compares this fingerprint to the set of fingerprints for items it has already stored. If the fingerprint is in this set, then the server sends a copy of the valid address for the item to the client. Otherwise, the server informs the client that it has no such item and sends the client a “ticket” containing the client-computed fingerprint that allows the client to send a copy of that item to the server for storage.
The purpose of a put operation is to actually copy an item to the server. A conventional put operation also has two steps. In the first step, the client sends a ticket it previously obtained and the item to the server. Once the server has a copy of the item, it computes the fingerprint from the item copy to double-check that the re-computed fingerprint matches the fingerprint contained in the ticket. If the two fingerprints do not match, the server rejects the put operation. In the second step, the server writes the item to storage and sends a copy of the valid address to the client.
System designers who use content-addressed storage (or compare-by-hash techniques) typically have two concerns related to the underlying assumptions of these techniques. First, if the number of items stored grows beyond the original design goals, the probability that two different items will have the same fingerprint may become uncomfortably large. This concern may be reduced by using a fingerprinting function that has a very large space (that is, a 256- or 512-bit output). While it is difficult to imagine any application (even one that is global in scale and long-lived) that will create enough unique items to make it likely that a well-designed fingerprinting function will encounter collisions in a space this large, it has already been shown that some contemporary fingerprinting functions have a smaller effective space than the size of their output would suggest, and so there may be cause for concern.
Second, the fingerprinting function may be “broken” by the discovery of a method for identifying colliding strings and it may therefore become feasible for a malicious party to use that method to create items that have the same fingerprints as other different items. Depending on the nature of the storage system and the ability of the malicious party to create such objects, that party may be able to corrupt or violate the correctness properties of the system. The argument has been made that, because a possibility of collisions exists, compare-by-hash and content-addressed techniques should not be used in situations that require absolute data integrity. The counterargument has been made that the probability of such problems occurring is so small that the benefit of these techniques far outweighs the practical risks.
Consequently, it would be useful if the fingerprinting function could be changed in conventional systems in order to correct problems that appear after the storage system has been in operation. However, changing the fingerprinting function in a conventional content-addressed storage system would require re-fingerprinting the entire data set and, in some conventional systems may invalidate the contents of the data set itself. For example, if fingerprints are encoded within any data objects stored on the server, there is no way for the storage system to determine that these fingerprints must be updated when the fingerprinting function is changed. It is possible to leave the old fingerprints in place, but it is then necessary for the storage system to be involved in every fingerprint comparison.