The performance of a modem file system depends upon assumptions about the structure of the file sets that it will store. File systems are not well suited to storing large sets of files with randomly chosen names or randomly chosen pathnames. An object storage system is similar to a file system but without the hierarchical directory structure. Objects may be named in an essentially random manner. Using an ordinary file system as an object storage system, to store hundreds of millions or billions of randomly named objects, results in very poor performance.
If the set of object names is large and the names themselves are large, a complete list of names will not fit into random access memory. The straightforward alternative is to implement a hash table on disk, as is done for example in the Venti storage system described in Sean Quinlan and Sean Dorward, “Venti: a new approach to archival storage,” in the Proceedings of the Conference on File and Storage Technologies (2002). This approach requires at least one access to an essentially randomly chosen disk location in order to get a pointer to the location of the object itself on disk.
Some object storage systems use a cryptographic hash of a block of data to name the block. A cryptographic hash is a function that deterministically computes a fixed width pseudo-random number (sometimes called a message digest or a fingerprint) from an input of any size. For example, the output of the SHA-256 cryptographic hashing algorithm is 256 bits wide (see National Institute of Standards and Technology, NIST FIPS PUB 180-2, “Secure Hash Standard,” U.S. Department of Commerce, August 2002).
The Venti storage system is an example of an object storage system that uses a cryptographic hash of a block of data to name the block. In the Venti storage system storage space is conserved by avoiding storing duplicate copies of identical blocks, which have identical object names. Another example of a storage system that uses cryptographic hashes for block naming is described in Margolus et. al, “A Data Repository and Method for Promoting Network Storage of Data,” US 2002/0038296 A1, Mar. 28, 2002. This second example supports a network protocol that allows bandwidth to be conserved in storing hash-named blocks of data by answering a query as to whether the name already exists in the storage system, and only sending the block if it does not. Supporting this kind of protocol well requires a storage system that can answer a query about the existence or non-existence of one object out of a very large set of objects efficiently and quickly.
This is the problem of detecting set membership. One of the earliest and most important contributions to this subject came from Burton H. Bloom in “Space/Time Tradeoffs in Hash Coding with Allowable Errors,” Communications of the ACM, July 1970. He observed that the problem can be simplified by allowing a small rate of false positive answers, which then need to be resolved using some other mechanism. His hashing technique requires about r(log2e) bits of storage per element of the set represented, in order to have a false-positive rate of 2−r. Note that this storage requirement depends only on the number of elements in the set, and not on how big the elements are. Bloom's technique (now called a Bloom Filter) is widely used today. It does not, however, provide a mechanism for indexing the data and finding it, only for testing whether it exists.
In the domain of text indexing and searching, the problem of efficiently storing indexes for large collections of text records has been studied. One technique used there is Inverted File Indexing, which is described for example in the book by Witten, Moffat and Bell, “Managing Gigabytes,” Morgan Kaufmann (1999). This technique involves sorting record numbers in the index and only representing differences in lists of record numbers. This technique wouldn't, however, save a significant fraction of the space in an index involving a sparse space of record numbers, as is the case with long hash-based names.
In addition to the problem of indexing randomly named objects, there is also the problem of organizing their storage on disk for efficient access and modification. The Venti storage system uses an append-log structure and makes no provision for ever changing, deleting or rearranging the stored items on disk. Although Venti was designed for archival storage, the lack of deletion capability is a significant drawback when archiving sensitive data that must, under law, be retained for some period of time but can then be deleted.