Many data transmission applications (e.g., communication between an aircraft and systems or devices on the ground, cellular communications (e.g., via cellular telephones), wireless communications or communications between computers or computer networks) utilize data compression techniques to reduce the size of the data to save space or transmission time. Data compression may be required for a lower bandwidth network, such as a wireless network or wireless vehicle network including an aircraft and ground. In addition, data compression may be used to optimize the bulk transfer of large documents, HTML, e-mail or any large amounts of data.
There are numerous known techniques for data compression such as LZ7, WinZip, PKZip and so on. Most of these compression processes require fixed length files as inputs in order to begin the compression process and to quantify character sequence redundancies needed to provide loss-less file reduction. These compression methods do not work efficiently with real-time streaming text broadcasts that are required by many wireless applications (e.g., aircraft to ground, Internet applications, mobile ground-vehicle uses, etc.). For example, many known compression processes require a finite block or page of information as an input to determine the “local” redundancies. The “local” redundancies are used to establish an alias transmission code set and the code set is typically discarded with each file processed. This type of compression requires batch processing of files at a network proxy server and results in a complex and costly system. Furthermore, compression processes that require a file to be fully read-in first introduce latencies that may have undesirable effects on some network client applications. In addition, the high variance in textual language features can limit the performance of existing compression routines (e.g., to two times compression).
Several data compression techniques use fixed length codes to represent characters of text. Fixed length codes however, may not provide the most efficient representation of characters in text. Alternatively, several data compression techniques have been developed that use variable length codes to represent characters in a text, such as Huffman encoding (or Huffman compression). Huffman encoding is an algorithm for compression of files based on the frequency of occurrence of a symbol or character in the text that is being compressed. Huffman encoding assigns smaller codes for more frequently used characters and larger codes for less frequently used characters. The result is a smaller number of bits in the compressed text.
Characters of words and phrases of text in a computer may be represented in a script or text code such as ASCII (American Standard Code for Information Interchange) or UNICODE (Unicode Worldwide Character Standard) to transfer data from one computer to another. ASCII is a format for text files in computers and on the Internet where each character is represented with a fixed length 7-bit binary number. UNICODE is a system for representing and processing (e.g., setting fixed length binary codes for text or script characters) texts in a plurality of languages. It may be desirable to identify and compress words and phrases in text rather than encoding individual characters in text. It may also be desirable to provide a compression technique that may be used to compress words and phrases of text in multiple languages. The compression of words and phrases in text, however, may result in millions of codes when representing one or more languages. A fixed code length for representing text words for multiple languages could be at least twenty (20) bits in length which may be an inefficient code length for representing shorter text words. While the use of a variable length code may reduce the number of bits needed to compress words and phrases from multiple languages, a problem is presented with how a device receiving the compressed text with variable code lengths identifies or discerns where one code (or length of bits representing a word or phrase in the text) ends and another code begins (i.e., how to identify the changes in code length in the compressed text).
Accordingly, there is a need for system and method for compressing words and phrases in text for multiple languages. There is also a need for a system and method for compressing real-time broadcast text streams that does not require a fixed length input file. In addition, there is also a need for a system and method for identifying and tagging code length changes in variable length codes. It would be advantageous to provide a system and method for compressing text that utilizes a “global” variable length code set for a plurality of languages.