In communications networks, such as Internet Protocol (“IP”) based wired and wireless networks, there are an increasing number of network devices whose functionality includes examining Layer 7 information in the Open System Interconnection (“OSI”) set of protocols. In the OSI model, there are seven layers, each reflecting a different function that has to be performed in order for program-to-program communication to take place between devices. The seven OSI layers are as follows: Layer 1—physical; Layer 2—data link; Layer 3—network; Layer 4—transport; Layer 5—session; Layer 6—presentation; and, Layer 7—application. Layer 7 supports application and end-user processes. At Layer 7, communication partners are identified, quality of service is identified, user authentication and privacy are considered, and any constraints on data syntax are identified. Everything at this layer is application-specific. Layer 7 provides application services for file transfers, e-mail, and other network software services. Telnet (i.e., terminal emulation) and the File Transfer Protocol (“FTP”), for example, are applications that exist entirely in the application level.
Of particular interest to many network devices is Uniform Resource Locator (“URL”) character string information. A URL is the unique address for a file that is accessible on the Internet. A common way to get to navigate to a Web site is to enter the URL of the site's home page file in a Web browser's address line. However, any file within that Web site can also be specified with a URL. Such a file might be a Web page other than the home page, an image file, or a program such as a common gateway interface application or Java™ applet. The URL contains the name of the protocol to be used to access the file resource, a domain name that identifies a specific computer on the Internet, and a pathname (i.e., a hierarchical description that specifies the location of a file in that computer). On the Web, which uses the Hypertext Transfer Protocol (“HTTP”), an example of a URL is “http://www.somewhere.com/files/file.txt”, which specifies the use of a HTTP (Web browser) application, a unique computer named “www.somewhere.com”, and the location of a text file or page to be accessed on that computer whose pathname is “/files/file.txt”. An example of a URL for a particular image on a Web site is “http://www.somewhere.com/pages/page.gif”. A URL for a file meant to be downloaded using the FTP would require that the FTP protocol be specified as in the following example: “ftp://www.somewhere.com/programs/program.ps”.
An example of a network device that may make use of URL information is a GPRS Gateway Serving Node (“GGSN”). A GGSN device is an interface between the General Packet Radio Service (“GPRS”) wireless data network and other networks such as the Internet or private networks. The use of a GGSN device by a service provider (“SP”), for example, allows billing behaviour to depend upon the Web site being accessed. For example, a GGSN provides the opportunity for SPs to charge their subscribers according to the type of content they wish to access and/or the way they prefer to access this content (e.g., multimedia messaging service (“MMS”), ring-tone downloads, Internet browsing, corporate services, mobile video, etc.).
Another example of a network device that may make use of URL information is a Web switch, also known as a Layer 7 switch, URL switch, Web content switch, or content switch. A Web switch is a load balancing network device that routes traffic to the appropriate Web servers based on the URL of the request.
Typically, a GGSN or Web switch must search for and/or compare URL character string information to perform its function. As network speeds increase, one problem that faces GGSN and Web switch providers is the need to perform URL searches at ever increasing speeds. Characteristics of URL searching include an easily recognized start point, variable length (potentially very long), and case sensitivity. Furthermore, it can be valuable to find the longest matching URL. For example, the device make choose a first action if the URL starts with “xyx*” and a second action if the URL starts with “xvz.abc*”. Existing URL search methods include the use of ternary content addressable memory (“TCAM”) devices, hashing, and discrete finite automata (“DFA”).
A TCAM is a specialized memory device that allows for the simultaneous comparison of input data (or an input pattern) of a given width (common widths include 72, 144, 288 or 576 bits) against a number of entries of the same width. TCAMs typically allow for the configuration of the width of the searches to be performed (e.g., allowing n entries of width w or n/2 entries of width 2 w) and for the partitioning of the search entries to allow different searches to use the same device. The different searches may each have different widths which provides the advantage of only activating part of the TCAM and hence reducing the power consumed by the device. Each bit of each entry in the TCAM may be programmed to match a bit value of 0, 1, or either (i.e., “don't care”) of the input data. These three states give the TCAM its name (i.e., “ternary”). The “don't care” value allows matching to values shorter than the full width of the TCAM.
Hashing algorithmically processes the input data to be compared to give a smaller value which can then be used to directly access a memory mapped location, which will then indicate if a match has occurred. One commonly used hashing algorithm is cyclic redundancy checking (“CRC”). The strength of this method is that it uses common memory which is relatively fast, low power, dense, and low cost. For reference, cyclic redundancy checking is a method of checking for errors in data that has been transmitted on a communications link. A sending device applies a 16- or 32-bit polynomial to a block of data that is to be transmitted and appends the resulting CRC code to the block. The receiving end applies the same polynomial to the data and compares its result with the result appended by the sender. If they agree, the data has been received successfully. If not, the sender can be notified to resend the block of data. This application of a polynomial may also be used to match data to known patterns. In this case, the compressed value (or a portion of it) is used to directly look up whether the data matches a pattern, that is, it is used as an address to a stored table. The table entries are configured by compressing the pattern(s) that are being searched for and the corresponding entry is set to indicate the pattern that has been matched. The act of compression (e.g., through the use of a CRC polynomial) means that multiple data patterns may generate the same CRC code. As a result, when a search finds a “hit” it is necessary to confirm that the applied data matches the uncompressed pattern indicated by the hit location. Another possibility is that two patterns that are to be searched for share the same compressed pattern. A further mechanism is required to resolve this situation. As more bits are used in the compressed value, the chances of these collisions occurring decreases, however, the size of the table to be accessed also increases, with each additional bit causing the table size to double.
Discrete finite automata (“DFA”) is a pattern matching mechanism based on a finite state machine. It steps through the data that is being compared one portion at a time (e.g., one byte at a time). The value of that portion of the data (e.g., a byte) is used to access a table that indicates the next state of the machine. There is a different table for each state of the machine. A pattern is represented by a series of states, one for each byte of the pattern. If the data causes the machine to follow the states, the pattern is matched. Wildcards, including variable length wildcards, are possible. Different partitioning of the data is possible (e.g., 16 bit units), however, using bytes corresponds to the most common encodings of characters. Larger partitions reduce the rate at which tables need to be accessed (e.g., 16 bit units would half the rate), however, they greatly increase the size of the tables required (e.g., tables for 16 bit units would have 256 times more entries than tables for single byte units).
However, these existing methods have several problems as follows.
One problem with the use of TCAM devices is that these devices are relatively expensive and consume significant amounts of power. As they compare the input data to every programmed entry, increased power consumption is unavoidable. In addition, TCAM searches use fixed width entries. If the input data or pattern to be matched is less than the width of the table in the device, the unused bits are wasted. On the other hand, if the pattern to be matched is wider than what the device supports, multiple searches are required along with a method of associating these searches. Each portion of the pattern will consume one entry in the TCAM (using capacity) and problems may arise when multiple matches occur during a search.
One problem with the use of hashing is that this method requires that the length of the input data or pattern to be matched be known and, in general, fixed. Also, given that the data is being compressed, different values may give the same result which is why most hashing methods have mechanisms to verify that the correct pattern was matched and provide a mechanism to handle the case of two (or more) patterns having the same hashed value. The smaller the hashed value, the higher the rate of collisions. However, the larger the hashed value (i.e., the less the compression) the larger the required table, with each bit doubling the size of the table. Also, hashing does not provide for any form of wildcarding, except if data manipulation can take place before the hashing, which implies that the same wild cards are applied to all patterns.
One problem with the use of DFA is that this method is limited by the need to access a table for each unit (e.g., byte) of the input data. The rate of processing is limited by the time required to access the table in memory, as the result needs to be returned before the next access can be made. Multiple searches on different input data (e.g., URLs from different packets) may occur in parallel to the limit of the memory's bandwidth. Another problem is the amount of memory required. If the DFA operates in units of one byte (for example), each byte of the pattern requires 256 words (i.e., width implementation dependent but at minimum 16 bits and more likely 32 bits). Of course, different patterns may share state tables. Furthermore, the addition of a pattern requires the programming of a table for each byte (i.e., a long pattern requires a lot of writes), if the new pattern shares any state with an existing pattern it is necessary to recognize this and to program the tables accordingly. Similarly, if a pattern is removed, it is necessary to check for shared states and only remove those no longer used.
A need therefore exists for an improved method and system for character string searching. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.