1. Technical Field
This disclosure relates generally to data processing, and more specifically, to methods and systems for generating and managing a cryptographic hash database.
2. Description of Related Art
The approaches described in this section could be pursued but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest. Tries are very fast tree-based data structures for managing strings in-memory, but are space-intensive.
A burst trie is a trie that uses buckets to store key-value pairs before creating branches of the trie. When a bucket is full, it “bursts” and is turned into branches. A burst-trie is almost as fast as a standard trie but reduces space by collapsing trie-chains into buckets. Another benefit is that a more efficient data structure for small sets of key-value pairs can be used in the bucket, making it faster than a conventional trie. Searching of burst-trie involves using a prefix of a query string to identify a particular bucket then using the remainder of the query string to find a record in the bucket. Initially, a burst tree consists of a single bucket. When a container is deemed to be inefficient, it is burst, and then replaced by a trie node and a set of child bins which partition the original container's strings. Although fast, the burst-trie is not cache-conscious. Like many in-memory data structures, it is efficient in a setting where all memory accesses are of equal cost. In practice however, a single random access to memory typically incurs many hundreds of clock cycles.
Although space-intensive, tries can be cache-conscious. Trie nodes are small in size, improving the probability of frequently accessed trie-paths to reside within cache. The burst-trie however, represents buckets as linked lists which are known for their cache inefficiency. When traversing a linked list, the address of a child can not be known until the parent is processed. Known as the pointer-chasing problem, this hinders the effectiveness of hardware prefetchers that attempt to reduce cache-misses by anticipating and loading data into cache ahead of the running program.
“HAT-trie: A Cache-conscious Trie-based Data Structure for Strings” is a publication by Nikolas Askitis and Ranjan Sinha, which is incorporated herein by reference in its entirety. It describes burst-trie algorithms for variable length strings but does not describe handling of these variable length strings. Additionally, the publication describes algorithms and data structures that are cache conscious but does not provide for improved efficiency of burst-trie algorithms.
Furthermore, none of the existing data structures allow for handling datasets exceeding the size of the available RAM.