This specification relates to compression of data communication, in particular, the compression and decompression of electronic mail and other electronic messages that are in message threads.
An electronic message thread (also referred to as an “electronic conversation”) is a series of electronic mail or other electronic messages. After an initial electronic message has been composed and distributed to one or more recipients, subsequent follow-up electronic messages in the thread are composed in response and generally concern related subject matter. The follow-up electronic messages can be, e.g., reply messages or forwarded messages. The electronic messages in electronic message threads have designated recipients. For example, the electronic messages may be addressed, e.g., to the composer of the initial message, to another recipient of the initial message, to a new recipient, or to combinations of such recipients. Electronic mail (i.e., e-mail) messages, chat messages, instant messaging messages, posts to asymmetric social networks, and other electronic messages with designated recipients can all form electronic message threads.
Data compression reduces the number of bits or other information-bearing units that are needed to encode information. Data compression generally relies on a defined coding scheme—the logic of which is used to both compress a source and decompress the compressed version. A coding scheme can be used to encode symbols or groups of symbols in a source (e.g., letters or other characters in an electronic message) as relatively shorter code words that are later decoded to return the source symbols or groups of symbols.
In some instances, the correspondence between the symbols or groups of symbols and the code words is determined based on how often symbols or groups of symbols appear or are estimated to appear in the source that is to be compressed. Estimates of how often the symbols or groups of symbols appear can be obtained from “seed dictionaries” such as, e.g., an English language dictionary for an English language message, a German dictionary for an German language message, news articles in a corresponding language, or the like. Such seed dictionaries thus provide an estimate of the frequency of symbols and groups of symbols in the source document to which a coding scheme is applied.
If a particular electronic message includes a single symbol or a series of symbols that occurs relatively often, representing those symbol(s) by a relatively shorter code word reduces the length of that particular message. However, the correspondence between code words and symbols or groups of symbols will generally be different when a different message is compressed. This is akin to the use of the word “refrain” in musical lyrics, where the characters and words that form the refrain are different in different songs. In other words, the characters and words represented by the word “refrain” are chosen in light of each individual song's content so that the lyrics can be compressed. With more common source symbols or groups of symbols in each particular message coded using shorter code words and less common source symbols or groups of symbols coded using longer code words, the extent of compression of each particular message is increased.
Huffman coding is a coding scheme that uses a specific logic for choosing the correspondence between code words and the symbols or groups of symbols they represent based on how often symbols or groups of symbols appear in a seed dictionary. The code words chosen by Huffman coding have different lengths and are assigned to represent symbols or groups of symbols based on the probability that the seed dictionary symbols or groups of symbols occur or are estimated to occur in the seed dictionary. Huffman coding is described, e.g., in the article entitled “A Method for the Construction of Minimum-Redundancy Codes.” David A. Huffman, Proceedings of the I.R.E., September 1952, pp. 1098-1102. Other coding schemes, such as arithmetic coding and LZW coding, also determine the correspondence between the seed dictionary symbols or groups of symbols and the code words based on how often symbols or groups of symbols appear or are estimated to appear in a seed dictionary.
Huffman coding and other coding schemes yield a prefix code (also known as a “prefix-free code”). A prefix code is a code in which none of the valid code words are a prefix of any other valid code word. A message encoded using a prefix code can be transmitted as a sequence of concatenated code words without markers to frame the words in the message. A recipient can decode the message by repeatedly finding and removing prefixes that form the valid code words.