A: Background of the Parent Invention
Data matching, and in particular, prefix matching is known and applied to various applications. In general, a database search is performed for data strings which are associated with a given input string or key. The association between the input string and the data strings, which is the search criteria, depends on the particular application. The particular search may require locating the longest, shortest or all data strings which are a prefix of a query string. The applications in which such matching is useful are numerous and, in particular, include layer 3 and layer 4 switching in TCP/IP protocols, directory lookup in a telephone context, on-line dictionaries and spell checkers, to name just a few.
The prefix matching problem constitutes the essential part of some applications in the computer realm and related area. The assumption in the prior art relating to these applications is that there are strings of an alphabet Σ which are ordered. The strings can have different lengths and can be prefixes of each other. The data strings are stored in a database along with other associated data.
A user may want to find the longest, smallest or all strings which are a prefix of a query string. In other applications, a user may be interested in finding all the data strings, such that a given input string is a prefix of them. It is very important to respond to any such query in a reasonable amount of time and in as efficient a manner as possible. Each application may have its own alphabet set and the number of characters in the alphabet handling these queries determines the complexity of the search.
The number of hosts on the Internet grows rapidly everyday. New data intensive applications such as multimedia, hypertext data, video conferencing, remote imaging, etc., cause the data traffic to explode. These applications demand higher bandwidth on the communication line and faster and more efficient computer networks. To keep up with these demands and the traffic, the speed of communication lines has been increased to several gigabits per second in the last few years. As a result, routers must forward IP packets more efficiently. Routers search the Internet Protocol (IP) routing tables to find the address of the next hops (or hubs) to which the packet is to be forwarded on the path towards the final destination. Each router has its own routing table consisting of pairs of prefixes of networks addresses and their corresponding hops. The routers usually must determine the longest matching network prefix with a packet destination address and take the corresponding hop. Finding the next hop for each packet becomes harder and harder because the increasing number of hosts on the Internet expands the global network and increases the number of hops to go through. Therefore, the size of the routing table grows accordingly. Increasing the speed of data links helps to shorten the time to send a packet. Advances in the semiconductor technology improve the processing capability of CPU chips and can help reduce the time of the table lookup. However, because the link speed grows faster than the processing speed, and the size of data is growing also, the IP lookup problem is resulting in a serious bottleneck on the information superhighway. The alphabet in this application is very limited (only {0,1}), however the problem is very challenging.
The IP lookup or layer 3 switching is not the only application of prefix matching of the {0,1} alphabet in routers. Internet Service Providers (ISPs) like to provide different services to different customers. Some organizations filter packets from the outside world by installing firewalls in order to deny access to unauthorized sources. Supporting this functionality requires packet filtering or packet classification mechanisms in layer 4 of TCP/IP protocols. Forwarding engines must be able to identify the context of packets and classify them based on their source and destination address, protocols, etc., or on all of this information. This classification must be performed at the wire speed. Routers attempt to handle this by keeping a set of rules which applies to a range of network addresses. Therefore, again we encounter the prefix matching problem in two dimensional space; i.e., for source and destination addresses of a packet.
Applications of prefix matching are not restricted to layer 3 and 4 switching. Some other useful applications include directory lookup in a telephone context, on-line dictionaries, spell checkers and looking up social security numbers. U.S. Pat. No. 5,758,024 discloses the prefix matching problem relating to computer speech recognition and proposes a compact encoding pronunciation prefix tree scheme. A method to improve the parsing process of source codes which use prefix matching is also disclosed in U.S. Pat. No. 5,812,853. The approach in this disclosure identifies the previously-parsed prefixes of a source, creates parsers in the parser states corresponding to the identified prefix and parses the remaining portion of the translation unit. Finally, U.S. Pat. No. 4,464,650 discloses an apparatus and method using prefix matching in data compression. Data compression is crucial in database applications as well as in data communication. The patent includes parsing the input stream of data symbols into the prefix and data segments, and using the previously longest matching prefixes to compress the data.
Traditionally, the prefix matching search has been performed by the Trie structure. A trie is based on the “thumb-index” of a large dictionary in which a word can be located by checking consecutive letters of a string from the beginning to the end. A trie is essentially an m_way tree whereas a branch in each node corresponds to a letter or character of alphabet Σ. A string is represented by a path from the root to a leaf node. The trie structure may be modified and applied to all of the applications discussed above. In some applications, for example in the longest prefix matching IP lookup context, researchers have been able to handle the problem in some more subtle ways than the trie structure, due in part to the limited number of characters in the alphabet. These methods do not have the generality or broad applicability of the trie structure. The main problems with trie structures are its inflexibility; i.e. the number of branches corresponds to the number of characters, and having additional blank nodes as place holders. Furthermore, in general, the search time is proportional to the length of the input strings.
Patricia Trie modified the binary trie by eliminating most of the unnecessary nodes and the modification is the basis of several new methods that have been proposed in the last several years. These approaches attempt to check several characters, or several bits, at each step, instead of checking only one character. Because checking several characters may deteriorate memory usage and leave many memory spaces unused, all of these approaches try to minimize the memory waste. V. Srinivasan and G. Varghese, in “Fast Address Lookups using Controlled prefix”, Proceedings of ACM Sigmetrics, September 1998 proposed to expand the original prefixes (strings) into an equivalent set of prefixes with fewer lengths, and then, apply a dynamic programming technique to the overall index structure in order to optimize memory usage. Other methods proposed a specific case wherein local optimization of memory usage was applied in each step. This is the case in S. Mission and G. Karlsson's, “Fast Address Look-Up for Internet Routers”, Proceedings of IEEE Broadband Communications 98, April 1998. Finally, a new scheme from Lulea University of Technology, attempts to reduce the size of the data set (routing table) so that it fits in the cache of a system. See Mikael Degermark, Andrej Brondnik, Suante Carlson and Stephen Pink's, “Small Forwarding Tables for Fast Routing Lookups”, Proceeding of SIGCOMM., 1997.
All of these multi-bit trie schemes are designed for the IP lookup problem and may work well with the existing size of data, the number of prefixes in the lookup table and with the current IP address length, which is 32. Nonetheless, these schemes generally do not scale well for larger size data or data of longer string length, for example, the next generation of IP (Ipv6) with 128 bit address.
A barrier to applying well known data structures, such as the binary search tree, to the prefix matching problem, is the lack of a mechanism to sort and compare strings of different lengths when the strings are prefixes of each other. Therefore, what has been needed is a new comparison, indexing and searching method and apparatus for performing prefix matching, that functions independent from the lengths of data or input strings, and is general enough in structure to apply to most, if not all, applications. Thus, a method and apparatus was needed that was generic and independent of any alphabet or character structure, while efficient in memory usage and search time.
In particular, efficient prefix trees for quickly accessing data were needed in applications which involve matching strings of different lengths of a generic alphabet Σ. In addition to exact match queries, the tree must also allow for the following queries: (1) finding the longest string which is a prefix of a given query string; (2) finding the smallest prefix of a given query string; (3) listing all the strings which are prefixes of a given query string; and (4) finding all the strings such that a given query string is a prefix of them.
B: Background of the Present Invention
The present invention discloses two methods for multidimensional indexing. Multidimensional indexing is a crucial part of a wide range of applications, including geographical databases, image databases, spatial databases, time series, packet classification, etc. Some other applications such as feature-based indexing or similarity matching can be transferred into the multi-dimensional indexing scheme by specifying each data object with attributes. In general, when the data objects cannot be uniquely identified by an attribute or key, we have to index them based on different keys in order to efficiently update the stored data and process queries. In traditional relational database management systems, whenever one primary key cannot not uniquely identify a row, we have to use index tables based on the different keys. This is an old issue in the database community and many data structures such as the K-D-B-tree or the R-tree have been proposed. However, with new emerging multimedia and image processing applications, more efficient data access methods are need than provided by the traditional structures.
The size of Internet grows continually and the data traffic on it explodes. Everybody wants to join in this environment. Applications like e-commerce and on-line sales have affected our daily lives very deeply. Also, internet users and companies have developed more concerns more about their privacy and security. For instance, some companies want to limit outside access to their internal resources. They may deny FTP (File Transfer Protocol) access to their computer systems. Therefore, firewalls have to recognize all ftp packets originating from the outside the company. Parents may not consider content of some sites on the World Wide Web appropriate for their children and may want to deny access to them. These protections imply filtering, and consequently packet classification. How to provide this filtering is one of the main problem for the Internet community. Filtering or packet classification is performed using rules. Each rule consists of headers identifying the packet flow, like the source and destination addresses, the source and destination ports, protocol, etc, and the action or policy which has to be applied to the packet flow. Each packet is compared with each rule and if the content of the packet match the rule, the action or policy in the rule is applied to the packet.
Linearly comparing the packet header with every rule in a database is very slow. In high speed routers, this filtering function is a bottleneck for the whole communication system. A better packet classification technique is needed in order to efficiently locate rules matching a data packet.
Also, some applications like video on demand and multimedia require some type of Quality of Service (QoS). Internet Service Providers (ISPs) like to provide different kind of services for their customers. For instance, they may want to have different billing policy for different types of data flows, or they may want to reserve some bandwidth for a special company. Thus, their forwarding engines have to categorize packets based on the TCP/IP header in order to apply the company's policy or generate billing lists, because it is impossible to identify packet flows based on only one header value. Therefore, any system seeking to provide QoS has to classify packet streams based on different header values. Again, this precipitates the familiar n-dimensional indexing problem.
A few difficulties have made the packet classification problem more challenging than just regular multidimensional indexing. First, as the communication line speed increases, there is a reduction the available time to process each packet. For instance, considering the minimum Ethernet packet length, in a system with a 10 Gbs (giga bit per second) line speed, the system is left with about 50 nanoseconds to classify and decide the fate of the packet. This small amount of time dictates that the classification search engine must be very efficient. Second, difficulties arise when different types of matching are needed (whether they be exact matching, prefix matching, or range matching). Unfortunately, none of the previously-existing multi-dimensional indexing methods can handle all of these types of matching at the same time.
To address these problems and provide an efficient packet classification system, two indexing approaches are disclosed. Both approaches are efficient enough to deal with high-speed data rates while handling all three types of matching. Subsets of the disclosed methods can be applied to regular multi-dimensional problems in the database realm such as feature-based indexing or spatial databases. The idea behind the first approach is to divide dataspace by three instead of two. Then, in order to avoid a high dimensionality problem, the method always divides the dataspace based on one dimension first. If this is not possible, the scheme changes the split dimension. Therefore, the method always keeps the split values of a single dimension in the split nodes. The second approach disclosed herein uses the same technique with the exception that it keeps a bit in the split nodes to indicate whether the split dimension needs to be changed. The use of the equal bits allows for the dataspace have the functionality of the “divide by three” method while in actuality only dividing the dataspace by two. This elimination of a subspace improves memory allocation.