1. Field of the Invention
The present invention relates generally to the data compression field; and, more particularly, to a method and apparatus for conducting a longest match search in a lossless dictionary compression procedure.
2. Description of the Prior Art
The compression of large ASCII (American Standard Code for Information Interchange) files having many repeated signal strings is frequently carried out using a dictionary compression procedure such as Lempel-Ziv compression (see Salomon, D., Data Compression The Complete Reference, Springer-Verlag, New York, Inc., 1998). Dictionary compression procedures compress files by replacing strings of consecutive symbols in the file by reference to a separate dictionary file or index, or to previous occurrences in the file that is being compressed. Dictionary compression techniques, however, are inefficient when compressing small data files or messages unless a large dictionary or look-up table is available. In general, a large dictionary yields more efficient compression than a smaller dictionary.
Much of the signaling traffic over the Internet is generated by ASCII-based protocols such as, for example, SIP (see Handley, M., Schulzrinne, H., Schooler, E. and Rosenberg, J., SIP: Session Initiation Protocol, IETF RFC 2543, 2000); SDP (see Handley, M. and Jacobson, V., SDP: Session Description Protocol, IETF RFC 2327, 2000); HTTP (see Fielding, R. et al., Hypertext Transfer Protocol-HTTP/1.1, IETF RFC 2616, 2000) and RTSP (see Schulzrinne, H., Rao, A. and Lanphier, R., Real Time Streaming Protocol (RTSP), IETF RFC 2326, 2000). The message size for each of these protocols is relatively small; and, consequently, a large dictionary must be used in order to obtain efficient compression. This can be achieved either by using a large static dictionary or by continuously updating a dynamic dictionary using messages that have already been sent during a session. Static and dynamic dictionaries can also be combined and used, for example, in the compression framework Roger (see R. Price, H. Hannu, et al., Signaling Compression, draft-ietf-rohc-sigcomp-06.txt) Internet Draft (work in progress), May 2002, and H. Hannu, J. Christoffersson, et al., SigComp-Extended Operations, (draft-ietf-rohc-sigcomp-extended-03.txt) Internet Draft (work in progress), May 2002).
Of considerable importance to achieving efficiency in terms of memory and processor usage in a dictionary compression procedure is the manner in which the compression is carried out, e.g., how the search algorithm used for longest match searches is designed. The standard longest match search methods used in Lempel-Ziv or other dictionary compression schemes are typically either of linear type (LZ77) or involve searching in binary trees (LZSS), (see, for example, the Salomon publication referred to above).
Linear searches are slow but can be carried out rather easily. Changing the dictionary, for example, updating the dictionary by appending or deleting text, is also easy to accomplish. Searching in binary trees is a much faster method; however, with binary search trees, it is a relatively slow operation to append or delete text from the dictionary and to correspondingly update the binary search tree.
There is, accordingly, a need for a dictionary compression procedure that includes a longest match search method and apparatus that is fast and that also utilizes a data structure that is easy to update when the dictionary is changed from time to time.
The present invention provides a longest match search method and apparatus for a dictionary compression procedure that is very fast and that utilizes a data structure that is easy to update when text is appended to or deleted from the dictionary.
According to the present invention, a method for conducting a search to match a string of symbols of a sequence of symbols with a stored string of symbols in a dictionary to provide a matched string of symbols comprises the steps of matching a string of symbols of the sequence of symbols, beginning with a selected symbol of the sequence of symbols, with a first stored string of symbols, beginning with a first position in the dictionary in which the selected symbol is stored, to provide a known matched string of symbols; identifying another position in the dictionary in which the selected symbol is stored; choosing a symbol of the sequence of symbols that follows the known matched string of symbols in the sequence of symbols; and comparing the chosen symbol of the sequence of symbols with a symbol stored in a corresponding position in the dictionary, beginning with the another position, to determine whether the symbol stored in the corresponding position matches the chosen symbol, wherein if the comparing step results in a match, determining if a string of symbols under examination between the selected symbol and the chosen symbol matches a stored string of symbols in the dictionary beginning with the symbol stored in the another location, to determine if the string of symbols under examination is longer than the known matched string of symbols.
The present invention recognizes that the number of comparisons that must be conducted in order to find a second matched string of symbols that is longer than a known matched string of symbols can often be reduced by first choosing a symbol in the sequence of symbols that follows the first matched string of symbols to define a string of symbols under examination, and then comparing the chosen symbol with a symbol stored in a corresponding position in the dictionary beginning with another position in which the selected symbol is stored. In particular, if this comparison does not result in a match, it may be unnecessary to compare intermediate symbols in the string of symbols under investigation. By reducing the number of comparisons that must be made, the overall time required to conduct the match search can be reduced.
According to a presently preferred embodiment of the invention, the chosen symbol comprises the first symbol in the sequence of symbols that follows the known matched string of symbols. If this chosen symbol does not match the symbol stored in the corresponding position in the dictionary, it is known with certainty that the string of symbols under investigation cannot be longer than the known matched string of symbols. Accordingly, it is unnecessary to compare other symbols of the string of symbols under investigation with symbols stored in corresponding positions in the dictionary. Particularly when the known matched string of symbols comprises many symbols, the method according to the present invention can eliminate a substantial number of comparisons and thus significantly shorten the time required to complete the match search method.
According to a further embodiment of the invention, the match search method comprises a longest match search method, and the identifying step comprises identifying all positions in the dictionary in which the selected symbol is stored. A string of symbols under investigation is then checked with respect to a string of symbols stored in the dictionary beginning with each position in the dictionary in which the selected symbol occurs to identify the longest possible matched string of symbols.
According to a further embodiment of the invention, the selected symbol comprises the first symbol of the sequence of symbols, and the first position in the dictionary comprises the most recent position in the dictionary in which the first symbol is stored. The another position in the dictionary comprises the second most recent position in which the first symbol is stored. The method is continued through the dictionary until all positions in which the first symbol is stored have been identified and the comparison made so as to identify the longest possible match.
According to yet a further embodiment of the invention, an apparatus for conducting a longest match search to find a longest possible match between a string of symbols of a sequence of symbols and a stored string of symbols in a dictionary comprises a dictionary and a processor. The dictionary includes first and second arrays and a table. The first array comprises each symbol stored in the dictionary. The second array comprises pointers to a previous position of each of the symbols stored in the first array, and the table comprises pointers to positions of the last occurrences in the first array of each symbol stored in the first array. The processor conducts a longest match search to identify a longest possible string of matched symbols utilizing the dictionary comprising the first and second arrays and the table. The apparatus of the invention not only permits a longest match search to be conducted more efficiently, but also permits the dictionary to be easily updated with new symbols or strings of symbols when required.
The match search method and apparatus according to the present invention can readily be incorporated into a lossless dictionary compression procedure such as Lempel-Ziv compression to facilitate the overall compression process. These capabilities provided by the present invention are important in many applications including applications that use the compression framework ROGER.
Yet further objects, features and advantages of the present invention will become apparent hereinafter in conjunction with the following detailed description of presently preferred embodiments thereof.