the concept of Kolmogorov Complexity is based on the amount of information contained in a string and one's ability to replicate the information with a program or model, which will be shorter than the original data. For instance, a sequence “x”, composed of 10,000,000 zeroes, could be represented by a much shorter program that generates the string by concatenating 10,000,000 zeroes. The minimum length of such program P(x), is referred to as |P(x)|=K(x).
To do the same with another string “y”, using P(x) as the starting point for encoding y, one can indicate this as P(x,y), and the minimum length of such program as |P(x,y)|=K(x,y). Assuming that “y” is similar to “x”, then P(x) would be a good starting point for P(x,y) and only small changes would be needed to generate “y” from P(x,y). Alternatively, when y is completely unrelated to “x”. Then our starting point would not provide any advantage. This concept is captured by the Normalized Information Distance (NID), which is defined as:
      NID    ⁡          (              x        ,        y            )        =                    K        ⁡                  (                      x            ,            y                    )                    -              min        ⁢                  {                                    K              ⁡                              (                x                )                                      ,                          K              ⁡                              (                y                )                                              }                            max      ⁢              {                              K            ⁡                          (              x              )                                ,                      K            ⁡                          (              y              )                                      }            
Since K(x) is actually not computable, one must use a surrogate. To this end a compressor may be utilized, which compresses a string to make its storage or transmission more efficient. One can denote the length of the compressed string “x” as C(x). The metric—analogous to the NID—is the Normalized Compression Distance (NCD), which is defined as:
      NCD    ⁡          (              x        ,        y            )        =                    C        ⁡                  (                      x            ,            y                    )                    -              min        ⁢                  {                                    C              ⁡                              (                x                )                                      ,                          C              ⁡                              (                y                )                                              }                            max      ⁢              {                              C            ⁡                          (              x              )                                ,                      C            ⁡                          (              y              )                                      }            When C(x)<C(y), then, the metric NCD(x,y) captures the improvement due to compressing string “y” using string “x” as the previously compressed database (numerator), with compressing string “y” from scratch (denominator).
These concepts have been previously used to create static classification, affinity groups in music [showing musical similarities/differences of various composers, linguistic taxonomies [showing the hierarchical grouping of many natural languages], biological taxonomies [showing the hierarchical grouping of animals based on DNA similarities], etc.