Field of the Disclosure
This disclosure relates generally to computing devices that implement lookup tables, and more particularly to systems and methods for implementing low latency lookup tables using hardware circuitry to compute hash functions that perform multiplication with sparse bit matrices.
Description of the Related Art
Computer networking devices such as routers, switches, and network interface cards commonly rely on lookup tables in hardware circuitry to quickly access information associated with incoming data packets for purposes such as routing, filtering, or load-balancing. Lookup tables for network applications allow for the fast retrieval of data values associated with a key, where the key is a bit string that can be found in or computed based on data received in incoming packets. Lookup tables can map a set of such keys to a set of addresses in a memory holding data associated with the keys.
Many existing hardware approaches focus on lookup tables that solve the longest-prefix match problem, specifically for IP routing applications. Such approaches typically assume fixed key sizes and a static/fixed set of tables with fixed-size entries, and they typically emphasize high lookup rates over low latency for individual lookups. For example, some traditional hardware implementations of lookup tables include content-addressable memories (CAMs) or, more specifically, ternary content-addressable memories (TCAMs). CAMs are dedicated hardware circuits combining memory locations for key entries with comparator logic such that a given input key can be quickly compared to all key entries stored in the memory in parallel. If an input key is found, the CAM either directly returns data associated with the key or the index of the memory location the matching key is stored in. This index can then be used to access data associated with the key, for example, by using the index as an address into a separate static random access memory (SRAM) or a dynamic random access memory (DRAM).
TCAMs allow key entries to not only use bit values of 0 and 1, but a third, “don't care” value, X. A value of X specifies that the corresponding bit position is not to be compared to the input key, but is to be considered a match. Some applications require that, in case of multiple matches, the entry with the longest sequence of matching, non-X bits starting from the most significant bit, commonly known as the longest-prefix match, to be the entry that is returned. While TCAMs offer low access latencies, their memory capacity is generally lower than the capacities offered by standard SRAMs of equal chip size. This is largely due to the added comparator logic per memory location. Furthermore, the power consumption of TCAMs tends to be high, and the hardware design dictates a maximum key size.
Some more recent hardware implementations of large lookup tables targeted at solving the longest-prefix match problem leverage standard SRAM technology. These implementations often utilize tree-based data structures such as TRIEs (which are also known as digital trees or prefix trees) stored in SRAMs. In some existing implementations, TRIEs map the digits of the keys to nodes in a tree structure such that the lookup of a key is done by traversing the tree from its root to its leaf nodes, such that at every node, the next digit in the input key determines the next-level node until a leaf node is reached. The traversal of the tree for key lookups may require multiple accesses to SRAM memory. For example, for m-bit keys, TRIEs require O(m) memory accesses in the worst case. By using multiple SRAMs and techniques such as pipelining, tree-based implementations can match or exceed the lookup rates offered by TCAMs. On the other hand, approaches that depend on multiple SRAM accesses commonly lead to significantly higher latencies for individual key lookups.
Existing software approaches, including software algorithms for evaluating perfect hash tables, provide more flexibility than existing hardware approaches in terms of the number of tables, key sizes, and data entry sizes. However, these algorithms are typically designed for sequential processing (e.g. as a sequence of processor instructions), and do not lead to efficient, parallel circuit implementations. For example, software techniques for fast lookups commonly include data structures such as hash tables or, more specifically, perfect hash tables. However, existing software algorithms for key lookups typically do not yield practical hardware implementations of lookup tables, as they often require long sequences of steps, sequential integer arithmetic, and/or conditional processing, i.e., properties that do not allow for efficient parallel or pipelined processing in hardware.
An existing FPGA-based lookup circuit applies the techniques of Cuckoo Hashing to look up keys in a table pattern. This circuit uses a two-level table to accommodate variable-length patterns. One form of “universal hashing” that has been described computes a hash function of a bit-string by multiplying the bit string, regarded as a bit vector by a matrix of bits in order to compute a linear transformation of the bit vector. One class of hash functions that has been described relies on combining the results of two or more primary hash functions, with the primary hash functions being regarded as mapping a set of keys into a graph or hypergraph.