Non-applicable
Non-applicable
The invention relates to dictionary based lossless data compression and encryption, particularly with respect to the manner in which the problem of statistical properties of the input data is treated.
The term xe2x80x9cdata compressionxe2x80x9d refers to the process of transforming a set of data into a smaller compressed representational form, so that it occupies less space on a storage or that they can be transmitted in less time over a communications channel. The data decompression process complements a data compression process and is the process of restoration of the original set of data exactly if the process is lossless or to some approximation if the process is lossy.
Many techniques have been used over the years to compress digital data. However they all based on the same few basic principles: a statistical coding, a dictionary coding, or a decorrelation (see: Storer J. A., Data Compression: Method and Theory, Computer Science Press (1993), p.p. 6,7,10,11,103-111,114,115,126-128; Williams R. N. Adaptive Data Compression, Kluwer Academic Publishers (1990); Salomon D. Data Compression. Springer (1997), p.p. 35-37,62-64,67,126,151-157).
A major example of statistical coding is Huffman encoding (see, Salomon D. Data Compression, Springer (1997), p.p.62-67. In this method, it is assumed that certain bytes occur more frequently in the file than others. For example, in English text the letters have some special frequency distribution, and the length of the code assigned to specific letter is inversely related to the frequency of that byte in the file. These bit sequences are chosen to be uniquely decodable. Huffman encoding may greatly expand a file if the pre-assigned scheme assumes considerably different frequency statistics than the one actually present in the file. In the general case of a binary file, produced by a random source, the frequency distribution could be close to uniform, and Hoffman compression will fail.
The dictionary algorithms are variations of the Lempel-Ziv technique of maintaining a xe2x80x9csliding Windowxe2x80x9d of the most recent processed bytes of data and scanning the Window for sequences of matching bytes. The input data character stream is compared character-by-character with character sequences stored in a dictionary to check for matches. Typically, the character-by-character comparison is continued until the longest match is determined. Based on the match, the compressed code is determined, and the dictionary is updated with one or more additional character sequences. If the sequence is found, the length of the matching sequence and its offset within the Window are output; otherwise, a xe2x80x9crawxe2x80x9d byte is output. One example of such scheme is described in U.S. Pat. No. 6,075,470 (Little), entitled xe2x80x98Block-wise Adaptive Statistical Data Compressionxe2x80x99, issued on Jun. 13, 2000 (p.p 9-30). The scheme with parallel compression using different machines is described in U.S. Pat. No. 6,417,789 (Har at al) entitled xe2x80x98Highly-Efficient Compression Data Formatxe2x80x99, issued on Jul. 9, 2002, that do not improve a rate of compression, if published statistical compression is not effective.
These dictionary algorithms require: a) a number of repetitions of the sequence, included in the dictionary; b) inclusion of the dictionary sequence in the output, so that matching rate must be high enough to actually achieve compression; c) an xe2x80x98exact matchxe2x80x99 between sequences in an input Window and a dictionary. For example, the letters xe2x80x98bxe2x80x99 and xe2x80x98cxe2x80x99 do not match, and the compression will fail while with a binary coding the difference is only one bit. Many techniques use the process of adaptation to the statistical description of the data. In general, type-specific compression techniques may provide a higher compression performance than general-purpose algorithms on the file for which the techniques are optimized. However, they tend to have a much lower compression performance if the file model is not correct.
The decorrelation technique is applied to highly correlated data, like space or medical images, with wavelets or Fast Fourier Transformation, as a set of basic functions for an input image expansion. These transformations are described in details in: Rao K. R., Yip P. C., Eds. The Transform and Data Compression Handbook. CRC Press (2001), p.p. 13-15, 35-37, 61-63, 73-75, 117-123, 161-167, 191. If the input sequence is highly correlated, the coefficients of this transformation will decay rapidly, and the number of them could be cut-off, providing compression with some loss of information. These losses could be acceptable for a human perception of an image, but unacceptable for compression of text or executable files, which are not correlated, and when no losses are acceptable. It is also unacceptable for correlated diagnostic or intelligence images, for which the high-frequency component can have an important informative value. The method of using transformations with digital signal processing described in U.S. Pat. No. 6,333,705 (Amone), issued Dec. 25, 2001
One example of the decorrelation technique is described in U.S. Pat. No. 6,141,445 (Castelli et al.), entitled xe2x80x98Multiresolution Losseless/ Lossy Compression and Storage of Data for Efficient Processing thereof,xe2x80x99 issued on Oct. 31, 2000, that used a lossy technique to produce the losseless compression by means of applying an orthogonal expansion (could be the wavelet expansion) to an input sequence. (p.p. 12-16). After an inverse transform and finding residuals between an input data and the wavelet transform. The sequence of residuals could be compressed using statistical techniques. That patent applied this approach to a general case of random binary data, disregarding the fact that it may be not correlated. However, it is not efficient in that case: the sequence of coefficients of these orthogonal transformations does not decay, and it can not be cut-off. As a result, the compressed file may be longer than the input file.
The difference dictionary scheme is described in U.S. Pat. No. 5,977,889 (Cohen), entitled xe2x80x98Optimization of Data Representation for Transmission of Storage Using Differences from References Dataxe2x80x99, issued on Nov. 2, 1999. This scheme uses the difference in the number of characters between the dictionary sequence and the input data while the selected sub-string must exactly match the dictionary sub-string.
The data compression process removes redundancy from the data, and this procedure can be related to the process of data encryption. The term xe2x80x9cdata encryptionxe2x80x9d refers to the process of transforming a set of data into a new representational form that prevents unauthorized reading of this data from a storage or a communications channel. The data decryption process is the process of restoration of exactly the original set of data from the encrypted representation. U.S. Pat. No. 6,411,714 (Yoshiura at al) entitled xe2x80x98Data decompression/decryption method and systemxe2x80x99, issued on Jun. 25, 2002, uses a statistical distribution of data and interlocked the processes of compression and encryption, but the process of compression is not improved.
A random number generator (RNG) is a software program or hardware circuit that uses a recursive mathematical expression or shifted operations in a register to produce a stream of random numbers. A random number generator is used in the prime art only to encrypt the data but not to improve compression. See, for example, U.S. Pat. No. 6,122,379 (Barbir), entitled xe2x80x98Method and Apparatus for Performing Simultaneous Data Compression and Encryptionxe2x80x99, issued on Sep. 19, 2000. The next example of using RNG for a data coding is U.S. Pat. No. 6,351,539 (Djakovic), entitled xe2x80x98Cipher Mixer with Random Number Generatorxe2x80x99, issued on Feb. 26, 2002, which does not perform any compression, but only encrypts the data using RNG. Because the RNG is actually a deterministic mean and its sequence is used in the deterministic order, these procedures can be broken. Besides, the process of compression is not successful for many data types, and any redundancy in the original data could be used to decode the ciphered information.
The hashing technique, which is actually a random transformation, is used mainly to perform an efficient sequence search. The compression procedure requests an exact match between an input string and a dictionary string (see, for example, U.S. Pat. No. 5,406,279 (Anderson at al), issued on Apr. 11, 1995). Distribution of the output from the hashing procedure cannot be adjusted to the distribution of the input sequence. Furthermore, the hashing procedure produces collisions because it sends the different numbers to the same place, and the process of transformation is degraded. The dictionary technique is described in U.S. Pat. No. 6,088,699 (Gampper), issued Dec. 25, 2001
In view of the above deficiencies of known data compression methods, the need exists for more efficient data compression techniques. The new trends in the technology of data processing and communication require a new approach to the old problems of data compression:
a) While microprocessing devices can achieve a high speed with parallel processing, big random memory and small energy consumption, mechanical storage means are the major obstacles to their use for hand-held and on-board computers, and it is very crucial to have a relatively high data compression ratio for data that can be stored in a compressed form.
b) While communication means and networks could use high-speed processors for sending and receiving sites, the time they need to occupy a channel is often unacceptable. This is especially important for communication with flying objects, including satellites because they need to transmit enormous amounts of data in a limited period of time while passing over areas where transmissions can be received, and with narrow bandwidth.
c) Business data warehouses have accumulated great amounts of data, for which the maintenance of these archives is an expensive and slow procedure, so that the data compression becomes an important tool to improve performance.
In many cases, including the ones listed above, it is preferable to use extensive computer power to compress and decompress data because it can result in a considerable competitive advantage. It is thus desirable to achieve the next goals, which are addressed by the present invention:
a) Use a calculated dictionary, instead of a dictionary derived from the input data, to eliminate the necessity of the inclusion of the stored dictionary in the compressed data.
b) Fit each sample of the input binary data with some level of approximation, instead of an trying to fit the random input data exactly with a dictionary derived from an a segment of input data; adapt a descriptive statistics of the calculated dictionary to the statistics of the input data, instead of adapting the dictionary to a set of exactly matched characters.
c) Apply the process of compression recursively, instead of stopping it if sub-strings in a dictionary do not match exactly the sub-string of the input.
d) Use the redundancy of the input data compared with a random calculated dictionary with the same frequency distribution, instead of relying on an unpredictable number of repetitions in the input data.
e) Use the transformation of the original data to a new sequence of approximated samples with RNG for data encryption with an excluded redundancy, utilizing a RNG in the order driven by the input data, instead of the order by which this dictionary sequence was generated.
The shortcomings of the known techniques are overcome and additional advantages are provided through the provision of a method for processing data, including the cases when these methods are impossible to use. In particular, the present invention introduces the concept of lossless approximation of a text, an image or executable files with an adaptive random number generator (RNG) that is used as a dictionary. This concept is presently unknown in the field of data compression and considered impossible. A solution that is simple and fast enough for practical implementation is found in the present invention.
An input sequence of a text or binary data is arranged in a sequence of groups of bits. Then each group is transformed into a sequence of samples of predetermined amplitude and length, that comprises a block of an input numerical sequence (BINS) in a memory mean. A frequency distribution then is found for the amplitudes of this BINS.
The present invention uses a RNG to generate a numerical sequence that covers the whole range of changes of BINS, and this RNG then used as a calculated dictionary. A certain transformation is then used to make the frequency distribution of the dictionary sequence similar to the frequency distribution of the BINS. As a result, the changes of the dictionary sequence are similar to the changes of the BINS not only in a whole range for a peak amplitude, but in a predetermined number of sub ranges. This operation reduces the distances between the samples of the input and the dictionary.
Each sample of the BINS is compared to all samples of the dictionary sequence in a look-up operation, until a dictionary sample is then found that has a least distance to the current BINS sample. The input sequence is then replaced with the sequences of the least distances and the indexes of the dictionary samples, which produced these distances. Thus the main problem with the currently available methodsxe2x80x94producing an exactly matched dictionaryxe2x80x94is eliminated. Then, the dictionary itself is eliminated from the memory, and does not comprise a part of the compressed data. Thereby the rate of the data compression improved further. As a result, an unpredictable and perhaps even uniformly distributed input sequence is transformed into the new sequence, where the subsequence of distances contains many repetitions and zeros. Thus applying currently available statistical or dictionary means to this new sequence gives better results of compression than if applied to the original sequence.
Because the length of the dictionary is predetermined to be smaller than a peak amplitude of the RNG, the average size of the index of the dictionary is smaller than its average amplitude, and the least distances are small too, thereby the output sequence requires less memory than the input sequence, thus achieving additional data compression. These properties of the output sequence improve the efficiency of data compression, where the current methods fail because the published techniques can only compress the data with a large enough number of repetitions and with an exactly matched dictionary, while in reality an input subsequence may be unpredictable. Furthermore, the RNG can be exactly reproduced during decompression. Thus a dictionary in the present invention is eliminated from the output data, which is considered impossible for the previous decompression methods. In the prime art the dictionary is the part of the output because it is not calculated, but continuously derived from the input data by a process of a perfect matching.
Further, we use our compression with several different RNG""s to see which one leads to the least distances. Choosing then this RNG over the others would further improve the compression. As a result, the process of compression is improved by the operation of the consequent refinement if a rate of compression cannot be improved for the current BINS with one dictionary operation.
The parameters of the process of compression are accumulated in a protocol, thus allowing for a completely reversible decompression process. The protocol comprises predetermined and calculated parameters: the amplitude, lengths of input BINS and dictionaries, descriptive statistics for the input sequence, maximum, minimum and a metrics for a frequency distribution for the dictionary sequence for each RNG. The present invention provides for the efficient storage and transmission of the digital data.
The methodology described above is used with many repetitions in a loop, replacing the input with output sequence for compression and for encryption. Disregarding the results of the process of data compression, the same process is used for encryption, because the present invention always removes redundancy in the input data, that is the major factor used for unauthorized decryption. The next factor to improve the process of encryption is that the sequence of the RNG is used in the order of the best approximation of the input numerical samples, but not in the order as it is generated, which significantly improves the level of protection of the input data from an attempt to break the code, because the sequence of the RNG is permutated randomly.
As will be appreciated, the invention is capable of other and different embodiments, and its several details are capable of modifications in various respects, all without departing from the spirit of the invention. Accordingly, the drawings and description of the preferred embodiments set forth below are to be regarded as illustrative in nature and not restrictive.