Conventional data storage systems, such as conventional file systems, organise and index pieces of data by name. These conventional systems make no attempt to identify and eliminate repeated pieces of data within the collection of files they store. Depending on the pattern of storage, a conventional file system might contain a thousand copies of the same megabyte of data in a thousand different files.
A reduced-redundancy storage system reduces the occurrence of duplicate copies of the same data by partitioning the data it stores into subblocks and then detecting and eliminating duplicate subblocks. A method for partitioning data into subblocks for the purpose of communication and storage is described in U.S. Pat. No. 5,990,810 by Ross Williams (also the inventor of the invention described here), and is incorporated by reference into this specification.
In a reduced-redundancy computer storage system, each BLOB (Binary Large Object—a finite sequence of zero or more bytes (or bits)) is represented as a sequence of subblocks from a pool of subblocks.
FIG. 1 (prior art) shows a pool of subblocks 10 indexed by a subblock index. By maintaining an index of subblocks 12, a storage system can determine whether a new subblock is already present in the storage system and, if it is, determine its location. The storage system can then create a reference to the existing subblock rather than storing the same subblock again. FIG. 2 shows how the representations of two different BLOBs 20, 22 can both refer to the same subblocks in the pool 24, thereby saving space. This sharing enables the storage system to store the data in less space than is taken up by the original data.
The subblock index 26 should contain an entry for each subblock. Each entry provides information to identify the subblock (distinguish it from all others) and information about the location of the subblock within the subblock pool. These entries can consume a significant amount of space. For example, if 128-bit (16 byte) hashes (of subblocks) were used as subblock identifiers, and 128-bit (16 byte) subblock storage addresses were used as addresses, then the size of each entry would be 32 bytes. If the mean subblock length were 1024 bytes, then this would mean that the index would be about 3% of the size of the data actually stored. This would mean that a storage system containing one terabyte would require a subblock index of about 30 Gigabytes (3% of 1TB).
The requirement to maintain an index, whose size is of the order of 3% of the size of the store, would not matter much if the index could be stored on disk. However, in reduced-redundancy storage systems, the index can be referred to very frequently, as each new BLOB to be stored must be divided into subblocks, and many of the subblocks (or their hashes) looked up in the index. If the mean subblock length is 1024 bytes, then storage of a twenty megabyte block of data may require dividing the data into about 20,480 subblocks and then performing an index lookup on each subblock. If the index is on disk, then this may involve at least 20,000 random access seek operations, which is far slower than the same number of memory accesses. If the index is held in memory instead of disk, then the system will run much faster. However, memory (RAM) is far more expensive than disk space, and the requirement that the RAM/disk ratio be of the order of 3% can be onerous for large stores.
Aspects of the present invention provide an indexing method that consumes far less memory than the system just described that holds the entire index in memory.