A perfect hash function is generally known to be a hash function that maps distinct elements in a set S to a set of integers with no collisions, where a collision is defined as two different inputs producing the same hash value or fingerprint output. A perfect hash vector (PHVEC) based on a perfect hash function is a very memory efficient data structure that can provide a mapping from a set of known keys into a unique vector position. In certain garbage collection processes, perfect hash vectors are used to track the liveness of data segments. Compared to other structures, perfect hash vectors are efficient. For example, a simple PHVEC that provides two states (live or dead) per key requires only 2.8 bit for each key, whereas a Bloom filter performing the same function will require at least six bits per key and will still produce on the order of about 5 to 10% collisions.
Despite having a small memory footprint, perfect hash vectors do have certain costs. For example, in a known backup file system (e.g., EMC Data Domain file system), it may take up to four random memory accesses for each PHVEC operation, as compared to just one access in the case of a block Bloom filter implementation. Second, it can take several hours to enumerate all the keys in the file system in order to create the PHVEC. Furthermore, all of the keys must be known and kept in memory for the PHVEC creation. In physical garbage collection system implementations, a PHVEC may be used to track the liveness of the LP (metadata) segments only. In newer garbage collection systems (such as perfect physical garbage collection, PPGC) systems, PHVEC may used to implement both the LP vector and the live vector, which can be up to hundreds or thousands of times larger than the LP vector. New techniques are thus needed to optimize both the creation and all the PHVEC operations including insertion, lookup and deletion.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain, Data Domain Restorer, and Data Domain Boost are trademarks of Dell EMC Corporation.