The present invention relates to a method and device for searching an ordered database containing key entries, and, more particularly, to a method and device for searching a monotonically-ordered database using transformed keys.
It is known that a large storage capacity is required for data packet classification and forwarding, in which large amounts of information must be stored in the information base. Storage space limitations affect all state-of-the-art ASEs, including Content Addressable Memories (CAMs) such as Binary CAMs and Ternary CAMs. Storage space limitation is also a key issue in the search engine technologies of HyWire Ltd.
Searching techniques typically require repeated accesses or probes into the memory storage in order to perform key comparisons. In large storage and retrieval systems, such searching, even if augmented by efficient search algorithms such as a binary search or higher-order B-tree searches or prefix B-tree searches, often requires an excessive amount of time (clock cycles).
Another well-known and generally faster method for storing and retrieving information from computer store involves the use of so-called “hashing” techniques. In a system using hashing, the key is operated upon by an operator to produce a storage address in the storage space. The operator is called a hashing function, and the storage space is called a hash table. The storage address is then used to access the desired storage location directly with fewer storage accesses or probes than sequential or binary searches. Hashing techniques are described in the classic text by D. Knuth entitled The Art of Computer Programming, Volume 3, in “Sorting and Searching”, pp. 506-549, Addison-Wesley, Reading, Mass. (1973), and more recently, in the contemporary classic text of R. Sedgewick entitled Algorithms in C++, pp. 231-243, Addison-Wesley, Reading, Mass. (1992).
Hashing functions are designed to translate the universe of keys into addresses uniformly distributed throughout the hash table. Typical hashing operations include truncation, folding, transposition and modulo arithmetic. A disadvantage of hashing techniques is that more than one key can translate into the same storage address, causing “collisions” in storage or retrieval operations. Some form of collision-resolution strategy must therefore be provided. For example, the simple strategy of searching forward from the initial storage address to the first empty storage location will resolve the collision. This technique is called linear probing. If the hash table is considered to be circular so that addresses beyond the end of the table map back to the beginning of the table, then the linear probing is done with “open addressing,” i.e., with the entire hash table as overflow spare in the event that a collision occurs.
An alternative to linear probing is a technique commonly referred to as “double hashing” or “multiple hashing”. When more than one key translates into the same storage address using the first hash function, the collision can be resolved by selecting a different hash function and “rehashing” those keys (that had returned identical results using the first hash function) in order to differentiate between them. Of course, there is a finite probability that more than one key will translate into the same storage address using the second hash function, in which case the new collision can be resolved by selecting a (different) third hash function and “rehashing” those keys once again in order to differentiate between them. This process can be repeated until all collisions have been resolved. According to Sedgewick, double hashing uses fewer probes, on the average, than linear probing. Sedgewick cites several examples of improved hashing methods, but cautions against                ‘premature use of advanced methods except by experts with serious searching applications, because separate chaining and double hashing are simple, efficient, and quite acceptable for most applications.’        
One area in which multiple hashing is less effective or even problematic is network applications. Although the average speed is an important parameter in such applications, a more important and often overriding requirement is a highly predictable, deterministic operation. For example, voice and video recordings can be transmitted as data via the Internet using a digital data channel. The Internet network utilizes routers to direct the data from the sending address to the destination address. Routers using multiple hashing routines to locate the destination address and deliver these data packets will have a characteristically high variance in the time required to locate the address. In most cases, typically about 70%-80% of the time, the multiple hashing technique will locate the destination address in the first memory access. However, in about 20%-30% of the time, a second memory access is required. Often, a third, fourth or fifth memory access is required in order to locate the address. Moreover, in the case of voice transmission, a high variance of this kind results in a broken up, non-uniform sound message. These disturbances are often referred to as “jitter”.
U.S. Pat. No. 6,434,662 to Greene, et al., discloses a system and method for searching an associative memory using input key values and first and second hashing functions. After a first hash function, the hash-based associative system allows for the selection of a second hash function that has been pre-computed at table build time to be perfect with respect to a small set of colliding key values, provides a deterministic search time independent of the number of table entries or width of the search key, and allows for pipelining to achieve highest search throughput.
Although the deterministic search time is of advantage, the pre-computing to identify the second hash function is laborious. Moreover, the pre-computing must be redone, inter alia, each time that an entry is added to or removed from the database.
Moreover, while hashing methods are suitable for exact search applications, hashing methods are inherently inappropriate for range search applications.
Also known in the art are Prefix B-trees, in which each node is searched in the same manner as a B-tree, but each key Ki in a Prefix B-tree is not a full key but is a prefix to a full key. The keys Ki of each node in any subtree of a Prefix B-tree all have a common prefix, which is stored in the root node of the subtree, and each key Ki of a node is the common prefix of all nodes in the subtree depending from the corresponding branch of the node. In a binary variant of the Prefix B-Tree, referred to as a Prefix Binary Tree, each node contains only one branch key and two branches, so that there are only two (“binary”) branches from any node. The Prefix Binary Tree is searched in the same manner as a Binary Tree, that is, branching left or right depending on whether the search key is less than or greater than the node key. There are also Bit Tree variants of the Prefix Binary Tree wherein distinction bits rather than prefixes are stored in the nodes. In particular, the values stored are the numbers of the bits in the keys that are different between two prefixes, thus indicating the key bits to be tested to determine whether to take the right or left branches.
It may thus be summarized that in the various types of Prefix Trees, a compression-like scheme is used to reduce the size of the entries stored in the tree. The key-compression approach has the benefit that the entire key value can be constructed without accessing data records or de-referencing pointers.
However, as noted by Bohannon, et al., in “Main-Memory Index Structures with Fixed-Size Partial Keys” (Mar. 28, 2001):                typical compression schemes such as employed in prefix B-trees have the disadvantage that the compressed keys are variable-sized, leading to undesirable space management overheads in a small, main-memory index node. Further, depending on the distribution of key values, prefix-compressed keys may still be fairly long resulting in low branching factors and deeper trees.        
Bohannon, et al., go on to propose a partial-key approach that uses fixed-size parts of keys and information about key differences to minimize the number of cache misses and the cost of performing compares during a tree traversal, while keeping a simple node structure and incurring minimal space overhead:                A key is represented in a partial-key tree by a pointer to the data record containing the key value for the key, and a partial key. For a given key in the index, which we refer to as the index key for the purposes of discussion, the partial key consists of (1) the offset of the first bit at which the index key differs from its base key, and (2) l bits of the index key value following that offset (l is an input parameter). Intuitively, the base key for a given index key is the most recent key encountered during the search prior to comparing with the index key.        
Bohannon, et al., articulate that “of the indexing schemes studied, partial-key trees minimize cache misses for all key sizes”. Bohannon, et al., further articulate that “the partial-key approach relies on being able to resolve most comparisons between the search key and an index key using the partial-key information for the index key. If the comparison cannot be resolved, the pointer to the data record is de-referenced to obtain the full index key value.” Thus, in the partial-key method taught by Bohannon, et al., the cache misses during the search operation, however reduced with respect to other indexing schemes, are a finite statistical probability that must be contended with. This partial-key method is thus inherently non-deterministic. Moreover, the possibility of such a cache miss renders pipelining using hardware solutions impractical, if not impossible.
There is therefore a recognized need for, and it would be highly advantageous to have, a high throughput, fully deterministic method of searching a database, a method that is efficient with regard to memory space, requires a low bandwidth, enables quick and facile maintenance of the database, and is inherently suitable for a pipelined hardware architecture.