1. Technical Field of the Invention
The present invention generally relates to complexity estimation. More specifically, the present invention is directed to a data transmission system that efficiently performs complexity estimation for the Kolmogorov Complexity in a given finite string (i.e., string) for determining whether to allow or reject transmission of data in the data transmission system.
2. Description of the Prior Art
The Kolmogorov Complexity is a fundamental measure of information with growing applications and importance. The estimation of the Kolmogorov Complexity is key to objective information system monitoring and analysis. There are many applications of the Kolmogorov Complexity, such as, information assurance and network security, where estimates of the Kolmogorov Complexity identify abnormal behavior. However, all the applications of the Kolmogorov Complexity are limited due to its incomputable nature and are impacted by improvements or innovations in the ability to efficiently estimate the Kolmogorov Complexity.
The Kolmogorov Complexity is a measure of the descriptive complexity in an object, i.e., a given string. More specifically, it refers to the minimum length of a program such that a universal computer can generate the specific sequence of the given string. The Kolmogorov Complexity can be described with the following equation:
            K      φ        ⁡          (      x      )        =            {                        min                                    φ              ⁡                              (                p                )                                      =            x                          ⁢                                  ⁢                  l          ⁡                      (            p            )                              }        .  In this equation, φ represents a universal computer, l(p) represents a program of length p, and x represents a string. The Kolmogorov Complexity equation dictates that a random string has a rather high complexity, which is on the order of its length, as patterns in the random string cannot be discerned to reduce the size of a program that generates the random string. On the other hand, a string with a large amount of structure has a fairly low complexity. Because universal computers can be equated through programs of constant length, a mapping can be made between universal computers of different types. The Kolmogorov Complexity of a given string on two different computers differs by known or determinable constants.
The Kolmogorov Complexity can also use a given string to reduce the complexity or the minimum length of a program necessary to produce a new string. More specifically, the Kolmogorov Complexity K(y|x) of a string y, given string x as input is described by the equation:
            K      φ        ⁡          (              y        ❘        x            )        =            {                                                                  min                                                      φ                    ⁡                                          (                                              p                        ,                        x                                            )                                                        =                  y                                            ⁢                                                          ⁢                              l                ⁡                                  (                  p                  )                                                                                                        ∞              ,                                                if                  ⁢                                                                          ⁢                  there                  ⁢                                                                          ⁢                  is                  ⁢                                                                          ⁢                  no                  ⁢                                                                          ⁢                  p                  ⁢                                                                          ⁢                  such                  ⁢                                                                          ⁢                  that                  ⁢                                                                          ⁢                                      φ                    (                                          p                      ,                      x                                        )                                                  =                y                                                        }        .  In this equation, l(p) represents a program length p, and φ is a particular universal computer under consideration. Consequently, knowledge of or input of the string x may reduce the complexity or program size necessary to produce the new string y.
The major difficulty with the foregoing Kolmogorov Complexity equations is that they are incomputable. More specifically, any program that produces the given string represents an upper bound on the Kolmogorov Complexity for that string. However, the lower bound on the given string remains incomputable.
The Lempel/Ziv 78 universal compression algorithm (i.e., “LZ78 compression algorithm”) has been used for estimating the Kolmogorov Complexity in various applications, most notably, information and network security. In these applications the LZ78 is used to compress a network protocol or other information and an inverse compression ratio of the compressed protocol or information to the uncompressed protocol or information used to estimate the complexity for the Kolmogorov Complexity. More specifically, because the Kolmogorov Complexity is an ultimate compression bound for a given string, a natural choice for estimating the Kolmogorov Complexity is the LZ78 compression algorithm. The LZ78 compression algorithm defines a measure of complexity for a finite string rooted in the ability to produce the string from simple copy operations. However, computing the estimate for the Kolmogorov Complexity of a string utilizing the LZ78 compression algorithm requires performing the entire compression processes of the LZ78 compression algorithm and comparing inverse compression ratios as a measure of the complexity for the string. These represent time and resource inefficiencies in computing the estimate for the Kolmogorov Complexity.
FIG. 1 is a representation of a prior art LZ78 compression algorithm 100. A finite string of length L is inputted into the LZ78 compression algorithm 100 via known methods and means at step 102. At step 104, using the inputted string, the LZ78 compression algorithm 100 forms an LZ78 partition, which is a central aspect of the LZ78 compression algorithm 100. More specifically, the LZ78 algorithm partitions a finite string into phrases that have not been seen before, thereby forming a codebook that enables the encoding of the finite string with small indices, given that the finite string is long and repetitious. For example, considering an inputted finite string of “1011010010011010010011101001001100010”, the LZ78 compression algorithm 100 forms an LZ78 partition by successively identifying new sub-strings hat has not yet been identified, i.e., a phrases, and inserting commas after the phrases. The following LZ78 partition results from the inputted finite string “1,0,11,01,00,10,011,010,0100,111,01001,001,100,010”.
Further with reference to FIG. 1, at step 106, the LZ78 compression algorithm 100 determines the encoding of determined phrases in the partition for the inputted string. The encoding is accomplished by representing each phrase in the partition as integer pair, where the first integer of the integer pair identifies the index of each phrase corresponding to all but the last bit of each phrase (if there is not prefix the index of zero is used. The second integer in the integer pair corresponds to the last bit of the phrase, i.e., a one or a zero. In the foregoing example, the following set of integer pairs is produced: (0,1), (0,0), (1,1), (2,1), (2,0), (1,0), (4,1), (4,0), (8,0), (3,1), (9,1), (6,0), and (4,0). Each integer pair (j, k) in the set of integer pairs is replaced with a single integer equal to 2j+k. The replacement maps each integer pair (or phrase) to a distinct integer. The foregoing set of integer pairs maps to the following integers: 1, 0, 3,5,4,2,9,8,16,7,19,11, and 8.
Still further with reference to FIG. 1, at step 108, the determined phrases identified by the integers are encoded using binary decimal codes padded with zeros to ensure the length of each word is a ceiling of log2(kj) for index number j and alphabet size k. The concatenation of the codes forms the compressed string at step 108. In order to estimate the complexity of the Kolmogorov Complexity, at step 110, the LZ78 compression algorithm 100 computed an inverse compression ratio of the compressed string to the original inputted string. The inverse compression ratio is based on the size (in bits or bytes) of the compressed string with respect to the original uncompressed string. The complexity estimate is outputted at step 112 and may be utilized by the aforementioned applications, such as, information security.
Therefore, there is a need in the art for providing an efficient method and system that estimates the Kolmogorov Complexity of a finite string without performing the inefficient encoding, compression and inverse compression ratio associated with the LZ78 compression algorithm's complexity estimation for the Kolmogorov Complexity.