Organizations are accumulating large amounts of electronic data. To facilitate the storage of such data, data storage systems need to manage increasingly large numbers of data objects (e.g. files, documents, objects, records, etc) and associate attributes with these objects. Examples of attributes that may be associated with an object include properties of the object that are visible to a user of the system (e.g. access control information, last access time, etc) and properties of the object that are used by the system to manage the object (e.g. location of the object in the system, checksum of the object, etc).
For example, when an object is no longer in use, it is desirable to reclaim resources held by the object and subsequently reuse those resources. To facilitate reclamation of resources, it is often necessary to associate a count of the number of references to an object or an indicator of whether the object is still in use (alive). In many cases, there is a level of indirection (or virtualization) such that an object is used by reference through another object. For clarity, we will refer to the former as a physical data object and the latter as a logical object. In such cases, the physical data object is alive only if the system currently contains a logical object that refers to it. For example, in a file system, a chunk of data is alive only if it is associated with a file that currently exists in the file system.
Some form of index structure is needed to associate attributes with objects. As the number of objects in a system increases, the index structure becomes very big, and it becomes difficult and expensive to use the index structure to look up object attributes quickly. In deduplicating storage systems such as those provided by Data Domain Inc. of Santa Clara, Calif., there could be millions of files and billions of chunks (also referred to as segments) of data shared among multiple files and within each file so that associating attributes with each file and/or segment requires a very large index.
To reduce the size of the index structure, one approach is to use probabilistic index structures that can maintain the correct association between objects and attributes most of the time. For example, a bloom filter may be used to indicate whether a segment is alive. The bloom filter, however, is still relatively large when there are many physical objects and it introduces false positives so that a dead physical object may be deemed to be alive.