The present invention relates to comparing and sorting data strings, and in particular, to comparing and sorting data strings of different lengths, such that the data strings can be queried using tree structures. Specifically, the invention relates to performing this process on data strings of different lengths that may be prefixes of each other.
Data matching, and in particular, prefix matching is known and applied to various applications. In general, a database search is performed for data strings which are associated with a given input string or key. The association between the input string and the data strings, which is the search criteria, depends on the particular application. The particular search may require locating the longest, shortest or all data strings which are a prefix of a query string. The applications in which such matching is useful are numerous and, in particular, include layer 3 and layer 4 switching in TCP/IP protocols, directory lookup in a telephone context, on-line dictionaries and spell checkers, to name just a few.
The prefix matching problem constitutes the essential part of some applications in the computer realm and related area. The assumption in the prior art relating to these applications is that there are strings of an alphabet xcexa3 which are ordered. The strings can have different lengths and can be prefixes of each other. The data strings are stored in a database along with other associated data.
A user may want to find the longest, smallest or all strings which are a prefix of a query string. In other applications, a user may be interested in finding all the data strings, such that a given input string is a prefix of them. It is very important to respond to any such query in a reasonable amount of time and in as efficient a manner as possible. Each application may have its own alphabet set and the number of characters in the alphabet handling these queries determines the complexity of the search.
The number of hosts on the Internet grows rapidly everyday. New data intensive applications such as multimedia, hypertext data, video conferencing, remote imaging, etc., cause the data traffic to explode. These applications demand higher bandwidth on the communication line and faster and more efficient computer networks. To keep up with these demands and the traffic, the speed of communication lines has been increased to several gigabits per second in the last few years. As a result, routers must forward IP packets more efficiently. Routers search the Internet Protocol (IP) routing tables to find the address of the next hops (or hubs) to which the packet is to be forwarded on the path towards the final destination. Each router has its own routing table consisting of pairs of prefixes of networks addresses and their corresponding hops. The routers usually must determine the longest matching network prefix with a packet destination address and take the corresponding hop. Finding the next hop for each packet becomes harder and harder because the increasing number of hosts on the Internet expands the global network and increases the number of hops to go through. Therefore, the size of the routing table grows accordingly. Increasing the speed of data links helps to shorten the time to send a packet. Advances in the semiconductor technology improve the processing capability of CPU chips and can help reduce the time of the table lookup. However, because the link speed grows faster than the processing speed, and the size of data is growing also, the IP lookup problem is resulting in a serious bottleneck on the information superhighway. The alphabet in this application is very limited (only {0,1}), however the problem is very challenging.
The IP lookup or layer 3 switching is not the only application of prefix matching of the {0,1} alphabet in routers. Internet Service Providers (ISPs) like to provide different services to different customers. Some organizations filter packets from the outside world by installing firewalls in order to deny access to unauthorized sources. Supporting this functionality requires packet filtering or packet classification mechanisms in layer 4 of TCP/IP protocols. Forwarding engines must be able to identify the context of packets and classify them based on their source and destination address, protocols, etc., or on all of this information. This classification must be performed at the wire speed. Routers attempt to handle this by keeping a set of rules which applies to a range of network addresses. Therefore, again we encounter the prefix matching problem in two dimensional space; i.e., for source and destination addresses of a packet.
Applications of prefix matching are not restricted to layer 3 and 4 switching. Some other useful applications include directory lookup in a telephone context, on-line dictionaries, spell checkers and looking up social security numbers. U.S. Pat. No. 5,758,024 discloses the prefix matching problem relating to computer speech recognition and proposes a compact encoding pronunciation prefix tree scheme. A method to improve the parsing process of source codes which use prefix matching is also disclosed in U.S. Pat. No. 5,812,853. The approach in this disclosure identifies the previously-parsed prefixes of a source, creates parsers in the parser states corresponding to the identified prefix and parses the remaining portion of the translation unit. Finally, U.S. Pat. No. 4,464,650 discloses an apparatus and method using prefix matching in data compression. Data compression is crucial in database applications as well as in data communication. The patent includes parsing the input stream of data symbols into the prefix and data segments, and using the previously longest matching prefixes to compress the data.
Traditionally, the prefix matching search has been performed by the Trie structure. A trie is based on the xe2x80x9cthumb-indexxe2x80x9d of a large dictionary in which a word can be located by checking consecutive letters of a string from the beginning to the end. A trie is essentially an m_way tree whereas a branch in each node corresponds to a letter or character of alphabet xcexa3. A string is represented by a path from the root to a leaf node. The trie structure may be modified and applied to all of the applications discussed above. In some applications, for example in the longest prefix matching IP lookup context, researchers have been able to handle the problem in some more subtle ways than the trie structure, due in part to the limited number of characters in the alphabet. These methods do not have the generality or broad applicability of the trie structure. The main problems with trie structures are its inflexibility; i.e. the number of branches corresponds to the number of characters, and having additional blank nodes as place holders. Furthermore, in general, the search time is proportional to the length of the input strings.
Patricia Trie modified the binary trie by eliminating most of the unnecessary nodes and the modification is the basis of several new methods that have been proposed in the last several years. These approaches attempt to check several characters, or several bits, at each step, instead of checking only one character. Because checking several characters may deteriorate memory usage and leave many memory spaces unused, all of these approaches try to minimize the memory waste. V. Srinivasan and G. Varghese, in xe2x80x9cFast Address Lookups using Controlled prefixxe2x80x9d, Proceedings of ACM Sigmetrics, Sep. 1998 proposed to expand the original prefixes (strings) into an equivalent set of prefixes with fewer lengths, and then, apply a dynamic programming technique to the overall index structure in order to optimize memory usage. Other methods proposed a specific case wherein local optimization of memory usage was applied in each step. This is the case in S. Mission and G. Karlsson""s, xe2x80x9cFast Address Look-Up for Internet Routersxe2x80x9d,Proceedings of IEEE Broadband Communications 98, April. 1998. Finally, a new scheme from Lulea University of Technology, attempts to reduce the size of the data set (routing table) so that it fits in the cache of a system. See Mikael Degermark, Andrej Brondnik, Suante Carlson and Stephen Pink""s, xe2x80x9cSmall Forwarding Tables for Fast Routing Lookupsxe2x80x9d, Proceeding of SIGCOMM., 1997.
All of these multi-bit trie schemes are designed for the IP lookup problem and may work well with the existing size of data, the number of prefixes in the lookup table and with the current IP address length, which is 32. Nonetheless, these schemes generally do not scale well for larger size data or data of longer string length, for example, the next generation of IP (Ipv6) with 128 bit address.
A barrier to applying well known data structures,. such as the binary search tree, to the prefix matching problem, is the lack of a mechanism to sort and compare strings of different lengths when the strings are prefixes of each other. Therefore, what has been needed is a new comparison, indexing and searching method and apparatus for performing prefix matching, that functions independent from the lengths of data or input strings, and is general enough in structure to apply to most, if not all, applications. Thus, a method and apparatus was needed that was generic and independent of any alphabet or character structure, while efficient in memory usage and search time.
In particular, efficient prefix trees for quickly accessing data were needed in applications which involve matching strings of different lengths of a generic alphabet xcexa3. In addition to exact match queries, the tree must also allow for the following queries: (1) finding the longest string which is a prefix of a given query string; (2) finding the smallest prefix of a given query string; (3) listing all the strings which are prefixes of a given query string; and (4) finding all the strings such that a given query string is a prefix of them.
The present invention provides a method and apparatus for matching data strings of different lengths, wherein one data string may be the prefix of another data string. The method and apparatus include comparing and sorting data strings of different lengths and utilizing data tree structures to search for matching data strings, as well as prefixes of a given string. The invention applies to data strings comprised of letters or characters from any alphabet or database.
A method is provided for matching strings of different lengths, wherein the strings can be prefixes of each other and can be from any alphabet xcexa3. Applications of this invention are numerous. When the alphabet is alphanumeric, the possible applications include on-line dictionaries, spell checkers, telephone directory lookup, computer speech recognition, data compression, source code compiling, as well as others. However, the most crucial applications of (prefix) string matching of different lengths are in layer 3 and 4 switching in the {0,1} alphabet set, and in particular, when routers try to forward IP packets in Internet or classify packets for providing different types of services for different customers.
The method and apparatus of this invention provide for comparing data strings of different lengths, sorting the data strings of different lengths based on this comparison and building tree structures for searching strings, as well as prefixes, within a large data set. A binary prefix tree is provided that efficiently utilizes machine memory space and gives a search performance comparable to the typical binary search tree. A static m_way prefix tree is also provided to get better search performance. Finally, a dynamic m_way prefix tree is provided, which performs well in data environments with high levels of transactions. The proposed method and apparatus, including the data structures, are simple to implement in hardware and software, scalable to accommodate large data sizes, independent from the data string lengths, flexible enough to handle higher dimension data and applicable to any character alphabet.
Specifically, according to one aspect of the present invention a method is provided for comparing a data set comprised of at least two data strings of indeterminate length in a common character set, with the method comprising comparing said data strings to identify the existence, or non-existence, of a common prefix portion. If a common prefix portion exists, then setting a specific check point character such that the probability of a character in the character set being greater than the check point character is about equal to the probability of a character in the character set being less than the check point character. If the common prefix portion comprises the entirety of one of said data strings, then comparing a first additional character in a longer length data string to the check point character to determine if the first additional character is less than or equal to the value of the check point character, with the longer length data string having a lesser value if the value of the first additional character is less than or equal to the value of the check point character and the longer length data string having a greater value if the first additional character is greater than the value of the check point character.
If the common prefix portion comprises less than the entirety of said data strings, then comparing a first discriminant character in each of the data strings to determine if one discriminant character is less than or greater than another discriminant character, and if the value of the first discriminant character of one of the data strings is less than the first discriminant character of another data string, the data string having a lesser value than another data string, if the value of the first discriminant character of one of the data strings is greater than the first discriminant character of another data string, the data string having a greater value than another data string. Finally, if the value of the first discriminant character or each data string is equal, comparing the next character in each data string.
If no common prefix portion exists, then the method compares the first character in one data string to the first character of another data string to determine if the first character is less than or greater than the value of the first character of the another data string. If the value of the first character is less than the first character of the another data string, then the data string has a lesser value. If the value of the first character is greater than the first character of the another data string, then the data string has a greater value. Finally, if the value of the first character is equal to the first character of the another data string, comparing the next character in each data string.
The method may further provide the step of sorting the data strings based on the data string value and may include first placing data strings having a common prefix portion into the sorting bag of the common prefix. Further, the method may first sort the data strings having no common prefix portion and then sort the data strings in the sorting bag.
The method may further comprise the step of recursively dividing the sorted data strings into two data spaces to build a binary search tree or recursively dividing the sorted data strings into more than two data spaces to build a static m_way tree. The tree may be divided such that the method first determines the data string having the shortest character length before recursively dividing the data strings into two data sets, with the data strings of lower value than the data string having the shortest character length and the data strings of higher value than the data string having the shortest character length divided into different sub-trees based on the shortest length data string.
In building the dynamic m_way tree additional data strings may be dynamically inserted into the tree to build a dynamic m_way tree. The m_way tree may be divided into two sub-trees if the number of elements at a node exceeds a pre-determined value or if the data strings at a node include a common prefix portion of the node data element. The m_way tree may further be divided into two sub-trees at a median point if the data strings at the node do not include any common prefix portion of the node data element.
The method further comprises dynamically inserting additional data strings by replacing a data element with an inserted data element if the inserted data element is a common prefix portion of the replaced element, then sorting all other data elements in the sub-tree of the replaced data element in respect to the inserted element.
The data strings may be alphanumeric prefixes of other alphanumeric data strings and the step of searching may provide for searching using an alphanumeric argument of the prefix. The method may further provide for data strings that are prefixes of network addresses in TCP/IP protocols along with a hops name and associated address in a router and further comprise the step of searching the data strings using a packet destination address to find a longest prefix match. The method may transmit data to the hop associated with the longest matching network address.
In the method, the data strings may be prefixes of network addresses in TCP/IP protocols along with port numbers, protocol name and address associated with the network address in a router, with the method further comprising using host addresses contained in the TCP/IP packet for searching and classifying packets based on the source and destination address. The host address may be contained in a TCP/IP packet with the method further comprising switching packets in layer 3 or layer 4 of the TCP/IP protocol.
The method, when transmitting packet information, may further comprise transmitting or filtering packet information using packet classification information and provide differentiated service or data protection based on the packet classification information.
The method may comprise determining the longest prefix string of a query string based on the sorted data elements or determining the shortest prefix string of a query string based on the sorted data elements. The method may also comprise determining all prefix data strings of a query string based on the sorted data elements or determining all data strings of which the query string is a prefix based on the sorted data elements.
In another aspect of the present invention a method is provided for comparing, sorting and searching a data set comprised of at least two data strings of indeterminate length in a common character set. The method comprises comparing said data strings to identify the existence, or non-existence, of a common prefix portion. If a common prefix portion exists, then setting a specific check point character such that the probability of a character in the character set being greater than the check point character is about equal to the probability of a character in the character set being less than the check point character.
If the prefix portion comprises the entirety of one of said data strings, then comparing a first additional character in a longer length data string to the check point character to determine if the first additional character is less than or equal to the value of the check point character, with the longer length data string having a lesser value if the value of the first additional character is less than or equal to the value of the check point character and the longer length data string having a greater value if the first additional character is greater than the value of the check point character.
If the common prefix portion comprises less than the entirety of said data strings, then comparing a first discriminant character in each of the data strings to determine if one discriminant character is less than or greater than another discriminant character. If the value of the first discriminant character of one of the data strings is less than the first discriminant character of another data string, the data string having a lesser value than another data string. If the value of the first discriminant character of one of the data strings is greater than the first discriminant character of another data string, the data string having a greater value than another data string. Finally, if the value of the first discriminant character of each data string is equal, comparing the next character in each data string.
If no common prefix portion exists, the method compares the first character in one data string to the first character of another data string to determine if the first character is less than or greater than the value of the first character of the another data string. If the value of the first character is less than the first character of the another data string, then the data string has a lesser value. If the value of the first character is greater than the first character of another data string, then the data string has a greater value. Finally, if the value of the first character is equal to the first character of the another data string, comparing the next character in each data string.
The method further provides for sorting the data strings based on the data string value, building a search tree and searching the data strings using the search tree. The method may comprise the step of first placing data strings having a common prefix portion into a sorting bag.
The data strings may be prefixes of network addresses in TCP/IP protocols along with a hops name and associated address in a router, with the method further comprising the step of searching the data strings using a host address of a computer network to find a longest prefix match. Further, the method may provide for transmitting packet information associated with the network address to a device associated with the longest matching network address. The host address may be contained in a TCP/IP packet, with the method further comprising switching packets in layer 3 or layer 4 of the TCP/IP protocol.
The data strings may be prefixes of network addresses in TCP/IP protocols along with port numbers, protocol name and address associated with the network address in a router, with the method further comprising using host addresses contained in the TCP/IP packet for searching and classifying packets based on the source and destination address.
In yet another aspect of the present invention a router for forwarding data packets is provided, wherein the router finds the next hop for each packet by finding the longest data prefix matching a packet destination address and a corresponding hop from a prefix lookup table, wherein data strings are of indeterminate length in a common character set. The router comprises a comparator for comparing said data strings to identify the existence, or non-existence, of a common prefix portion, and wherein if a common prefix portion exists, the comparator sets a specific check point character such that the probability of a character in the character set being greater than the check point character is about equal to the probability of a character in the character set being less than the check point character. If the prefix portion comprises the entirety of one of said data strings, then the comparator compares a first additional character in a longer length data string to the check point character to determine if the first additional character is less than or equal to the value of the check point character, with the longer length data string having a lesser value if the value of the first additional character is less than or equal to the value of the check point character and having a greater value if the first additional character is greater than the value of the check point character.
If the common prefix portion comprises less than the entirety of said data strings, then the comparator compares a first discriminant character in each of the data strings to determine if one discriminant character is less than or greater than another discriminant character. If the value of the first discriminant character of one of the data strings is less than the first discriminant character of another data string, then the data string has a lesser value than another data string. If the value of the first discriminant character of one of the data strings is greater than the first discriminant character of another data string, then the data string has a greater value than another data string. If the value of the first discriminant character of each data string is equal, the comparator compares the next character in each data string.
If no common prefix portion exists, then the comparator compares the first character in one data string to the first character of another data string to determine if the first character is less than or greater than the value of the first character of another data string, and if the value of the first character is less than the first character of another data string, the data string has a lesser value. If the value of the first character is greater than the first character of another data string, the data string has a greater value. If the value of the first character is equal to the first character of the another data string, the comparator compares the next character in each data string.
The router also includes a sorter for sorting the data strings based on the data string value and a database builder for building a search tree. The router may also comprise a search engine for finding the longest matching data string to a data packet. Additionally, the router may comprise a transmitting unit for transmitting the hop associated with the longest matching network address.
The host addresses contained in the TCP/IP packet may be used by the router to search and classify packets based on the source and destination address. With the host address contained in a TCP/IP packet, the router switches packets in layer 3 and layer 4 of the TCP/IP protocol.
The router may further comprise a transmitting unit providing differentiated service or data protection based on the packet classification information.
These and other features and advantages of the present method and apparatus, will in part apparent, and in part pointed out hereinafter.