Computer storage devices comprise a variety of storage technologies that can be divided into block-addressable “disks”—such as solid-state drives (SSD) based on NAND or other non-volatile memory (NVM) technology, hard-disk drives (HDD), compact discs (CD), and the like—and byte-addressable “memories”—such as random access memory (DRAM), static random access memory (SRAM), phase change memory (PCM), resistive random access memory (RRAM), or 3D cross point, and the like. Generally, data is moved from disk to memory before it is used by a processor of a computing system. For data stored in a filesystem, the filesystem, or an operating system, often manage this movement, resulting in a filesystem cache in memory reflecting portions of the data stored on disk.
Probabilistic filters are commonly used in data storage systems to efficiently determine whether a data item is stored in a data structure without, for example, having to load the entire data structure from disk. For example, in a key-value data storage system a probabilistic filter can be used to determine the possible existence of a key in a key-store without having to load and search the key-store. Probabilistic filters are generally high-speed and space-efficient data structures that support set-membership tests with a one-sided error. These filters can establish that a given set entry is definitely not represented in the set of entries. If the filter does not establish that the entry is definitely not in the set, the entry can or cannot be in the set. To restate, negative responses (e.g., not in set) are conclusive, whereas positive responses (e.g., can be in set) incur a false positive probability (FPP). Generally, the trade-off for this one-sided error is space-efficiency. For example, some probabilistic filters, such as Cuckoo filters and Bloom filters, use approximately seven bits per entry to provide a three percent FPP, regardless of the size of the entries.
There are a variety of probabilistic filters, which include Cuckoo filter and Bloom filters, the operation of which are here provided for illustrative purposes. Cuckoo filter operate by inserting a f-bit fingerprint of a key into one of two buckets. The first bucket is a hash of the key and the second bucket is derived by hashing the fingerprint. If both buckets are full, an existing fingerprint is removed to make space, and then that fingerprint is moved to its own alternate bucket. Locating a key involves inspecting the buckets for a key to determine whether the fingerprint exists. The basic Bloom filter comprises an array (e.g., Bloom filter array) of M bits (initialized to an empty value, such as zero) and k different hash functions that each map a set element to one of the M bits, resulting in a k bit representation of the set element in the Bloom filter. When an element is added to the filter, each of the bits corresponding to the hash functions in the array are set to one. To determine the presence of the element (e.g., performing a Bloom filter query or a Bloom query), the same hash functions are applied to determine the corresponding locations in the array for the queried element. If every location has a value of one, as opposed to zero, then the key can be in the set. If one location has a value of zero, then the key is not in the set.