A multi-bit trie (mtrie) is a tree data structure that is predominantly used for longest prefix match of a given key (e.g. IP address) to obtain the associated value (e.g. route or next-hop). At its simplest each node in the mtrie is of two types (1) leaf or (2) mtrie node. A leaf node, as the name suggests stores the value and terminates the search. Often, mtrie nodes store a stride size S indicating the number of bits from the remaining portion of the key to process in order to determine which branch to take. The number of possible branches is 2^S and the key bits (S of them) provide the index of the child node.
Instead of storing all 2^S pointers (to the children brances), the mtrie nodes are often optimized for space. This is done by storing the children of a given mtrie node in a contiguous array (mtrie block). And mtries node store a base pointer to the start of that node's children branch mtrie block. Given the base address (BA) of mtrie block, the size of each mtrie node (SZ) in the mtrie block and the index of the child node (I) one can easily compute the memory address (AD) of the child node using the formula: AD(I)=BA+SZ*I.
FIG. 1 is a block diagram illustrating a 3 level mtrie according to the prior art. Level 0 100 is the root node and has a stride of 3 bits and base pointer Ptr1 pointing to its child mtrie block level 1 110. Notice that the children of the root at level 1A are stored in a contiguous mtrie block as an array of 2^3 entries (0 to 7). The entry at index 001 is shown as having a stride of 3 and a base pointer of Ptr2 pointing to the base address of level 2A 120. The entry at index 011 of level 1A 110 is shown as having a stride of 2 and a base pointer of Ptr3 pointing to the base address of level 2B 121. The entry at index 110 of level 1A is shown as having a stride of 1 and a base pointer of Ptr4 pointing to the base address of level 2C 122.
A lookup in the mtrie starts off at the root along with the key supplied. At each intermediate mtrie node a portion of the key (as specified by the stride) is used up to determine the next node, and so on. Finally once a leaf node is reached the lookup terminates with the value stored in the leaf. FIG. 1 illustrates a look up of a key in the mtrie using four steps, circled as 1-4. Step 1 shows that a lookup of the key 001100 is being performed. The stride in level 0 100 is three, so the look up uses the three most significant bits to index the child branch level 1 at index 001. Step 2 indexes level 1A 110 accessing an mnode at index 001 pointing to level 2A 120 and showing a stride of three. Step 3 uses the stride of three and Ptr2 to index level 2A at index 100. Step 4 accesses a leaf node at index 100 which holds the value associated with key 001100.
Memory technology is such that each memory device is organized into a set of banks (e.g. 4 or 8). A subset of bits from the address is chosen as the bank selector when the device is initially configured. In general the bank selector bits are chosen such that first chunk (e.g. first 8, first 16, or first 32 bytes) is assigned to the first bank and the next chunk assigned to the next bank and so on. The term striping is also used to describe the size of the chunks and how the addresses are distributed across the different banks. Apart from the memory technology the number of banks and striping size (chunk size) is also a function of memory controller that manages the memory.
FIG. 2 is a block diagram illustrating a 4 level mtrie stored in 4 banks of memory according to the prior art. FIG. 2 shows four banks of memory, banks 0-3, in four columns. The memory addresses start at 0x000 at the first memory chunk in bank and increment from left to right such that the first memory chunk of bank 1 is at address 0x010, the first memory chunk of bank 2 is at address 0x020, and the first memory chunk of bank 3 is at address 0x030. The second memory chunk of bank 0 is at address 0x040. This addressing scheme continues to the last, seventh, memory chunk of bank 0 being at address 0x180, the last chunk of bank 1 being at address 0x190, the last chunk of bank 2 being at address 0x1A0, and the last chunk of bank 3 being at address 0x1B0. Thus, an array allocated through contiguous addresses spanning more than 16 bytes (the width of one memory chunk in FIG. 2) would span at least two memory banks.
In FIG. 2, a four level mtrie begins with a root node level 0 200 stored in the first chunk of bank 0 at address 0x000. The root node points to level 1 210 which starts at second memory chunk of bank 1, address 0x050, and spans to the second memory chunk of bank 3, address 0x07F. A mtrie node in level 1 210 points to level 2A 220, stored at the fifth chunk of bank 0 and spanning to the fifth chunk of bank 1, stored at addresses 0x100 through 0x11F. Another mtrie node in level 1 210 point to level 2B 221, stored at the fifth chunk of bank 2 and spanning to the fifth chunk of bank 3, stored at addresses 0x120 through 0x13F. A mtrie node in level 2A 220 points to level 3A 230 in the seventh chunk of bank 0 at address 0x180. A mtrie node in level 2B 221 points to level 3B 231 in the seventh chunk of bank 2 and spanning to the seventh chunk of bank 3 at addresses 0x1A0-0x1BF.
Another aspect that defines a memory device (and the controller) is the maximum transaction rate. The maximum transaction rate is the number of access (reads and writes) which can be performed per second. A memory device has a aggregate maximum transaction rate and a per bank maximum transaction rate. The following table provides an exemplary comparison of two on-chip memory technologies, static random access memory (SRAM) and embedded dynamic random access memory (eDRAM), and two off-chip memory technologies, reduced latency dynamic random access memory (RL-DRAM) and double data rate synchronous dynamic random access memory (DDR-SDRAM).
ThroughputThroughputNumberPer BankAggregateMemoryofStripingTotal(transactions(transactionsTypeBanksSizeSizeper second)per second)SRAM116 byes 128KB750750eDRAM816 bytes1MB150600RL-816 bytes64MB60480DRAMDDR-832 bytes1GB20160SDRAM
Even though a memory device (with b banks) is rated for an aggregate throughput of M transactions per second, the effective throughput achieved depends on how the accesses are evenly distributed across the b banks. For example, the eDRAM device is shown to have an aggregate throughput of 600 tps for the device as a whole and a per bank throughput 150 tps. This means that for one to realize the maximum throughput provided the accesses to this eDRAM needs to be spread over at least 4 of the 8 banks.
In the worst case, if all the accesses targeted a single bank, then the effective throughput will be that of a single bank (M/b). The term bank collision is used to indicate the fact that the access to a memory device, are unequally distributed across the banks. Since bank collisions pull down the performance of the memory device it is undesirable.
Existing schemes for avoiding bank collisions rely on randomness and statistical distribution to provide an even distribution access across the banks. The randomization can be performed at memory allocation time so that there is no regularity in which addresses are assigned to which nodes. If multiple data structures are mapped to the same memory device one can see how the randomization in allocation for multiple data structures reduces the probability of uneven distribution across banks. Hashing is also used to further scramble how the banks are picked given the memory addresses themselves.
In the context of a network processing unit (NPU) used in a packet forwarding application, and specifically an mtrie used for internet protocol (IP) address lookups, the lookups are generally keyed off some packet attributes (e.g., source IP address). Each of the lookups in turn translates into a sequence of memory accesses depending on the key used for the lookup. Hence, once the data structure has been setup the access pattern depends on the traffic mix. This is where additional assumption on randomness and statistical distribution of access come into play.
However the above techniques do not completely eliminate the possibility of bank collisions. Therefore, while the reliance on randomness and statistical distribution is good enough for most real world applications, it may not be appropriate for all scenarios (when one or more of these assumptions are invalid).
In the context of an NPU some data structures are placed on chip for raw performance reasons. On-chip memory tends to be small in size. So given the size limitation, such memories can only accommodate one or two data structures or in some cases only a portion of a larger data structure. Hence on-chip memory tends not to get the benefit of randomization in memory allocation to avoid bank collisions. Also, solutions that depend on randomness are not suitable in some hardware implementations where there is a need to strictly budget for and guarantee performance.