The present invention relates generally to compression of text-based data or protocol data and, more specifically, to compression of data to be transmitted over networks having bandwidth limited communication links.
In general, data compression converts data defined in a given format to another format containing fewer data bits than the original format. When the original data is needed, the compressed data is decompressed to restore the data to the original format. Data compression can be classified as lossy and lossless. As the term suggests, data is preserved during compression and decompression in a lossless method. Unlike lossless compression, lossy compression refers to methods in which the decompressed data is not exactly the same as the original data. Lossless data compression algorithms can be classified into dictionary coding and statistical coding types. The present invention is related to the dictionary coding type. The most widely used dictionary coding algorithms are the Lempel-Ziv algorithms and their variants. In particular, the LZ77 algorithm refers to the compression method as disclosed in Ziv et al. (xe2x80x9cA Universal Algorithm for Sequential Data Compressionxe2x80x9d,IEEE Transactions on Information Theory, Vol. IT-23, No.3, May 1977, pp.337-343) and the LZ78 refers to the method as disclosed in Ziv. et al. (xe2x80x9cCompression of Individual Sequences via Variable Rate Codingxe2x80x9d,IEEE Transactions on Information Theory, Vol. IT-24, No.5, September 1978, pp.530-535). LZ77 is based on the principle of replacing a repeated sequence of characters by a reference to an earlier occurrence of the sequences by a pointer. LZ78 parses a stream of input data characters into coded values based on an adaptively growing reference source, such as a look-up table or dictionary, for string matching.
In many text-based applications and application protocols, some or all of the text data are case insensitive. In case-insensitive text data, the semantics of the text data are the same regardless of the texts being represented in lowercase or uppercase. For example, most HTTP (Hypertext Transfer Protocol) message header fields, including HTTP keywords, are case-insensitive. Thus, the URI (Uniform Resource Identifier) http://Nokia.com is equivalent to the URI HTTP://NOKIA.COM or http://nokia.com. In addition, if an HTTP message body contains an HTML (HyperText Markup Language) document, all element names and attribute names in the HTML documents are case insensitive. The Lempel-Ziv algorithms treat all data as pure bytes in their string match logic and, therefore, do not address the issue relating to case sensitivity. As a consequence, a string of texts in the input data will not be compressed even though the dictionary already contains the same string with a different case configuration. This not only limits the compression ratio, but also causes waste in the use of the dictionary and its memory storage. In particular, the problem affects compression performance in two scenarios: 1) a pre-populated dictionary with protocol or application specific data, such as keywords; and 2) application data generated by one source as dictionary to compress same application data generated by another source.
It is thus advantageous and desirable to provide a method and device for improving the compression ratio in text-based data compression.
It is a primary objective of the present invention to provide a method and device for efficiently compressing text-based data or protocol data.
According to the first aspect of the present invention, there is provided a method of coding communication data in a form of data segments for providing encoded data. The method is characterized by
finding a match for a data segment in a reference source; and
compressing the data segment for providing a compressed data segment in the encoded data if the match is found, wherein said finding is carried out in a manner based on case sensitivity of the data segment.
A data segment comprises at least one data unit and is either case sensitive or case insensitive. The data segment is case insensitive if said at least one data unit is case insensitive.
Advantageously, the method is further characterized by
modifying the data segment if the data segment is case insensitive for providing a modified data segment so as to allow said finding to be based on a match for the modified data segment in the reference source.
When the data segment comprises a text string and each of said at least one data unit comprises a character, the character is either in a first case configuration or in a second case configuration, said method characterized in that said finding is carried out in a manner as if the data segment is case insensitive, and the method is further characterized by
providing in the encoded data information indicative of the case configuration of the character in each of said at least one data unit so as to allow the compressed data segment to be decompressed based on the provided information.
When the text string comprises a number of characters, and the provided information comprises a number of data bits, each corresponding to one of said number of characters, each data bit is assigned a value indicative of whether the corresponding character is in the first case configuration or in the second case configuration.
Alternatively, the provided information comprises a code having a value indicative of the character in each of said at least one data unit being in a first case configuration. The first case configuration is lowercase and the second case configuration is uppercase. It is also possible that the first case configuration is uppercase and the second case configuration is lowercase.
Advantageously, the text string comprises a plurality of characters including a leading character and at least one following character, and the provided information comprises a code indicative of only the leading character in the text string being in a first case configuration.
According to the second aspect of the present invention, there is provided a compressor for encoding communication data in a form of data segments for providing encoded data. The compressor is characterized by
a reference source;
a comparison means for finding a match of a data segment in the reference source; and
an encoding module for compressing the data segment if the match is found for providing a compressed data segment in the encoded data, wherein the comparison means has a matching algorithm for finding the match based on case sensitivity of the data segment.
The compressor is further characterized by
a parser, responsive to the data segment, for determining the case sensitivity of the data segment and for providing information indicative of the case sensitivity of the data segment to the comparison means, and the provided information is contained in a data flag conveyed to the comparison means.
Advantageously, the data segment comprises at least one character and each of said at least one character is either in a first case configuration or in a second case configuration. The compressor is further characterized by
a transformer module, responsive to the data segment, for converting each of said at least one character in the data segment to a first case configuration for providing a case-transformed data segment to the comparison means so as to allow the comparison means to find the match in the reference source based on the case-transformed data segment.
Alternatively, when the data segment comprises at least one character and each of said at least one character is either in a first case configuration or in a second case configuration, the compressor is characterized in that
the compressor finds the match of the data segment in the reference source as if the data segment is case insensitive, and that
information indicative of the case configuration of each of said at least one character is provided in the encoded data so as to allow the compressed data segment from the encoded data to be decompressed based on the provided information.
According to the third aspect of the present invention, there is provided a communication component in a telecommunications network, adapted to receiving and transmitting communication data in a form of data segment. The component is characterized by
a compressor for encoding the received communication data for providing encoded data for transmission, the compressor comprising:
a reference source;
an algorithm for finding a match of a data segment in the reference source; and
an encoding module for compressing the data segment if the match is found so as to provide a compressed data segment in the encoded data, wherein the algorithm finds the match based on case sensitivity of the data segment.
The communication component can be a mobile terminal or a support node in an infrastructure of the telecommunications network.
According to the fourth aspect of the present invention, there is provided data coding system for coding communication data in form of data segments. The system is characterized by:
a compressor for encoding the communication data, and
a decompressor for decoding encoded data, wherein the compressor comprises:
a reference source;
an algorithm for finding a match of a data segment in the reference source; and
an encoding module for compressing the data segment if the match is found so as to provide a compressed data segment, wherein the algorithm finds the match based on case sensitivity of the data segment, and wherein the decompressor comprises:
a further reference source, and
a further algorithm for finding the data segment from the further reference source.