A deterministic, lossless, and efficiently invertible data compression function F is a transformation that accepts an input data segment A of length NA bits and outputs a data segment B of length NB bits. The data segment B can be applied as the input to an inverse function F−1 that outputs the aforementioned data segment A. The operations of F and F−1 are illustrated in FIG. 1. The utility of the compression function is the fact that NB may be less than NA. Thus, if A were to be transmitted over a network or stored to data storage media, F could be employed to reduce the bandwidth consumed during the transmission of A or to reduce the data storage media space required to maintain A, respectively. If F is effective and A has low entropy, it is likely that NB is less than NA. However, if F is poorly designed or A has high entropy, NB may be greater than or equal to NA; that is, in some cases, F can actually inflate the data segment.
It is well known that compression functions can improve the performance of data management operations. In particular, compression can improve the performance of secure data storage and transmission systems. Such security systems often depend on cryptographic operations to achieve critical security goals. Examples of common cryptographic operations include (but are not limited to) symmetric-key encryption algorithms, asymmetric-key encryption algorithms, keyed hash functions, and digital signatures. The latencies of most conventional cryptographic operations depend on the size of the data (in bits) applied as inputs to the operations. That is, as the size of the input increases, the latency of a cryptographic operation also increases. Thus, by compressing the input data prior to cryptographic processing, the latency of the cryptographic operation decreases, and the performance of the secure system increases.
To illustrate this concept using a practical example, consider a network transaction in which the confidentiality of transmitted data is ensured to some degree by a secure symmetric-key encryption algorithm such as the Advanced Encryption Standard (AES) or the Data Encryption Standard (DES). These two publicly-known encryption algorithms accept an arbitrarily-sized plaintext P (which is the data to be encrypted) and a fixed-length secret key K as inputs, and the algorithms output an encrypted ciphertext C that is approximately the same size as the input plaintext. Furthermore, for each of these encryption algorithms, there is a corresponding decryption algorithm that accepts a ciphertext C and a secret key K as inputs and that outputs a plaintext P. The process of encryption and decryption is illustrated in FIG. 2. If the encryption algorithm is secure, then given the ciphertext C without knowledge of the secret key K, a passive adversary cannot feasibly compute P.
In the example network transaction, a sender and a receiver share a secret key K that was securely established and securely distributed to the two parties prior to the current transaction. The sender can employ the secret key to encrypt a message P and obtain the encrypted result C. The sender then transmits the message to the receiver, and the receiver may use his copy of the secret key K using the appropriate public decryption algorithm to decrypt C and obtain P.
Compression can be used to improve the performance of such a system by reducing the amount of data to be encrypted and decrypted and by reducing the amount of data to be transmitted. A secure network transaction that employs compression is illustrated in FIG. 3. By compressing the plaintext P using the compression function F prior to encryption to obtain P′, where the bit length of P′ is smaller than that of P, the amount of computation required to encrypt the plaintext can be reduced significantly. Thus, if the computation required to compress P plus the computation required to encrypt P′ (which is P in compressed form) does not exceed the computation required to encrypt P in uncompressed form, then performance speedups can be realized by the sender. Similarly, if the computation required to decrypt the received message C′ plus the computation required to decompress the decrypted result P′ using the decompression function F′ does not exceed the computation required to decrypt C in uncompressed form, then performance speedups can be realized by the receiver.
For the encryption algorithms cited above, the output ciphertext C is of size (i.e., bit length) approximately equal to that of input plaintext P. Thus, encrypting the compressed plaintext P′ yields an output ciphertext C′ that is likely smaller than the ciphertext C. Therefore, by compressing the plaintext prior to encryption, the network transmission requirements involved in sending the ciphertext can be reduced, which can lead to improved network performance and to improved system performance.
Many existing methods, such as the IP Payload Compression Protocol (IPComp) and common file encryption utilities, employ compression algorithms to reduce the length of arbitrarily-sized data segments prior to the application of a data management operation such as encryption or a data comparison. A variety of lossless compression algorithms can be used in these methods; some algorithms are designed for high compression and/or decompression speed, and some algorithms are designed to achieve high compression ratios. A compression ratio for a data segment is simply the size of the data segment divided by the size of the data segment in compressed form.
Existing compression methods effectively accelerate data management operations for many applications, but in certain situations, these methods can lead to performance degradation. Consider a scenario involving random access to some quantity of stored data, such as a file. In many applications, it is desirable to specify a particular segment of data to be accessed within a file beginning at a particular logical address within that file. For this reason, most file systems are designed to facilitate fast access to a range of data given a starting address within an uncompressed file. If the file is stored in compressed form, however, a significant portion (if not all) of the file may need to be read and decompressed before the desired data segment within the file can be accessed. Thus, for typical files, this decompression can lead to serious performance loss.
More sophisticated compression schemes can alleviate this problem to some degree by providing efficient compressed data indexing mechanisms, but other performance problems still remain. Specifically, due to the characteristics of compression algorithms and compression encodings, when new data is written to an arbitrary location within a file that is stored in compressed form, much of the file may need to be decompressed and then re-compressed in order to correctly encode the modified file containing the newly written data. If file write operations are frequent, the performance loss resulting from this issue in conventional systems can be significant.