1. Field of the Invention
The present invention relates to the area of lossless data compression. More specifically, the present invention relates to a method for detecting and separating data blocks with stationary informational characteristics as a preliminary stage for lossless data compression.
2. Background Discussion
Lossless compression is a process of economically representing source information. Lossless compression methods are used in many areas, principally in data storage and data transmission, and improved coding techniques are continually sought to reduce the amount of memory required to store data and/or to increase the amount of data that can be transmitted over a communications channel in a given amount of time. As a general rule, lossless compression methods are more efficient when applied to particular types of data they are specifically designed to compress. Furthermore, many compression methods are sensitive to changes in data characteristics. Detecting and separating data blocks of a particular type with stationary informational characteristics (solid block splitting) is a very important preliminary stage in many compression technologies, especially those operating on independent blocks (such as Burrows-Wheeler compression).
There have been several efforts to implement an intelligent data block detection mechanism. However, while data type detection is more or less successful in most cases, existing solid block splitting solutions are less than satisfactory.
A widely used and straightforward approach to solid block splitting works as follows: Small portions, or blocks (b), of data within an input data set are consecutively analyzed, and decisions are made on whether they should be added to a solid block (B) currently being formed. If not, a rejected block is considered to be the beginning of the next solid block to be formed. The decision is usually based on a comparison function F(B, b) and a threshold δ:F(B,b)≦δB=B+b. 
The following terms have the definitions set out below.
N—alphabet size;
Xε{B, b}—block;
XΣ—the number of symbols in block X;
Xsym—the number of times symbol, sym, appears in block X;
S—the number of different possible statistical states (statistical states allow the use of comprehensive context-based estimations);
Xst—the number of times statistical state st appears in block X;
Xsymst—the number of times symbol sym appears in statistical state st within block X;
psymst({Xji}iε{1,K,S}, jε{1,K,N})—estimated probability of symbol sym appearing in state st within block X;
pst({Xji}iε{1,K,S}, jε{1,K,N})—estimated probability of state st appearing within block X.
The following formulas describe evident relations between these quantities:
            X      ∑        =                            ∑          sym          N                ⁢                  X          sym                    =                        ∑                      st            =            1                    S                ⁢                  X          st                      ,            X      st        =                  ∑                  sym          =          1                N            ⁢              X        sym        st              ,            X      sym        =                  ∑                  st          =          1                S            ⁢                        X          sym          st                .            
In most cases, probability estimations are calculated using the following two formulas:
                    p        sym        st            ⁡              (                              {                          X              j              i                        }                                              i              ∈                              {                                  1                  ,                  K                  ,                  S                                }                                      ,                          j              ∈                              {                                  1                  ,                  K                  ,                  N                                }                                                    )              =                                        α            1                    ·                      X            sym            st                          +        1                                          α            1                    ·                      X            st                          +        N              ,                    p        st            ⁡              (                              {                          X              j              i                        }                                              i              ∈                              {                                  1                  ,                  K                  ,                  S                                }                                      ,                          j              ∈                              {                                  1                  ,                  K                  ,                  N                                }                                                    )              =                                        α            2                    ·                      X            st                          +        1                                          α            2                    ·                      X            Σ                          +        T              ,      α    1    ,            α      2        ≥    0.  
Collected statistics may be insufficient if blocks are small but the number of states is large. Therefore, it may be impossible to estimate probabilities reliably. High computational complexity is also a problem, because recalculating probabilities for every block may be unacceptable. With increasing block size, the complexity problem may diminish and probability estimation becomes more precise, but block splitting efficiency still cannot be guaranteed as relatively large blocks cannot precisely separate small areas of data with stationary informational characteristics.
In practical data compression technologies, a state-based approach is rarely applied. In many cases it is assumed that S=1, and calculations are based on a simplified formula:
                    p        sym            ⁡              (                              {                          X              j                        }                                j            ∈                          {                              1                ,                K                ,                N                            }                                      )              =                            β          ·                      X            sym                          +        1                              β          ·                      X            ∑                          +        N              ,      β    ≥    0.  
Such simplification leads to a performance tradeoff: it reduces computational complexity but may negatively affect the precision of block splitting.
The following two comparison functions are usually employed in practice:
            F      ⁡              (                  B          ,          b                )              =                        1        -                                                                                                  ∑                                          st                      =                      1                                        S                                    ⁢                                                            p                      st                                        ⁡                                          (                                                                        {                                                      b                            j                            i                                                    }                                                                                                      i                            ∈                                                          {                                                              1                                ,                                K                                ,                                T                                                            }                                                                                ,                                                      j                            ∈                                                          {                                                              1                                ,                                K                                ,                                N                                                            }                                                                                                                          )                                                                                                                                                                ∑                                          sym                      =                      1                                        N                                    ⁢                                                                                    p                        sym                        st                                            ⁡                                              (                                                                              {                                                          b                              j                              i                                                        }                                                                                                              i                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  S                                                                }                                                                                      ,                                                          j                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  N                                                                }                                                                                                                                    )                                                              ·                                                                                                                        ln                  ⁢                                                                          ⁢                                      (                                                                  p                        sym                        st                                            ⁡                                              (                                                                              {                                                          b                              j                              i                                                        }                                                                                                              i                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  S                                                                }                                                                                      ,                                                          j                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  N                                                                }                                                                                                                                    )                                                              )                                                                                                                                                                ∑                                          st                      =                      1                                        S                                    ⁢                                                            p                      st                                        ⁡                                          (                                                                        {                                                      B                            j                            i                                                    }                                                                                                      i                            ∈                                                          {                                                              1                                ,                                K                                ,                                T                                                            }                                                                                ,                                                      j                            ∈                                                          {                                                              1                                ,                                K                                ,                                N                                                            }                                                                                                                          )                                                                                                                                                                ∑                                          sym                      =                      1                                        N                                    ⁢                                                                                    p                        sym                        st                                            ⁡                                              (                                                                              {                                                          B                              j                              i                                                        }                                                                                                              i                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  S                                                                }                                                                                      ,                                                          j                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  N                                                                }                                                                                                                                    )                                                              ·                                                                                                                        ln                  ⁢                                                                          ⁢                                      (                                                                  p                        sym                        st                                            ⁡                                              (                                                                              {                                                          B                              j                              i                                                        }                                                                                                              i                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  S                                                                }                                                                                      ,                                                          j                              ∈                                                              {                                                                  1                                  ,                                  K                                  ,                                  N                                                                }                                                                                                                                    )                                                              )                                                                                                  ,
      F    ⁡          (              B        ,        b            )        =                                                                        ∑                                  st                  =                  1                                S                            ⁢                                                (                                                            B                      st                                        +                                          b                      st                                                        )                                ⁢                                                      ∑                                          sym                      =                      1                                        N                                    ⁢                                      (                                                                  B                        sym                        st                                            +                                              b                        sym                        st                                                              )                                                                                                                          ln              ⁢                                                          ⁢                              (                                                      p                    sym                    st                                    ⁡                                      (                                                                  {                                                                              B                            j                            i                                                    +                                                      b                            j                            i                                                                          }                                                                                              i                          ∈                                                      {                                                          1                              ,                              K                              ,                              S                                                        }                                                                          ,                                                  j                          ∈                                                      {                                                          1                              ,                              K                              ,                              N                                                        }                                                                                                                )                                                  )                                                                                                                                      ∑                                      st                    =                    1                                    S                                ⁢                                                      B                    st                                    ⁢                                                            ∑                                              sym                        =                        1                                            N                                        ⁢                                          ln                      ⁢                                                                                          ⁢                                              (                                                                              p                            sym                            st                                                    ⁡                                                      (                                                                                          {                                                                  B                                  j                                  i                                                                }                                                                                                                              i                                  ∈                                                                      {                                                                          1                                      ,                                      K                                      ,                                      S                                                                        }                                                                                                  ,                                                                  j                                  ∈                                                                      {                                                                          1                                      ,                                      K                                      ,                                      N                                                                        }                                                                                                                                                        )                                                                          )                                                                                                        +                                                                                          ∑                                  st                  =                  1                                S                            ⁢                                                b                  st                                ⁢                                                      ∑                                          sym                      =                      1                                        N                                    ⁢                                                            b                      sym                      st                                        ⁢                                          ln                      ⁡                                              (                                                                              p                            sym                            st                                                    ⁡                                                      (                                                                                          {                                                                  b                                  j                                  i                                                                }                                                                                                                              i                                  ∈                                                                      {                                                                          1                                      ,                                      K                                      ,                                      S                                                                        }                                                                                                  ,                                                                  j                                  ∈                                                                      {                                                                          1                                      ,                                      K                                      ,                                      N                                                                        }                                                                                                                                                        )                                                                          )                                                                                                                                      .  
The first function is a comparison of empirical entropies of blocks B and b. If entropies are close, blocks B and b are considered to be parts of one solid block. The second function is a comparison of two different estimations of the size of the compressed representation of blocks B and b. One estimation (in the numerator) assumes that blocks are compressed together, while the other estimation (in the denominator) assumes that blocks are compressed separately. The first function is easier to calculate, but the result of the comparison does not guarantee efficient block separation in terms of final compression efficiency. The second function, while requiring more computational resources, is more suitable for practical compression applications. Nevertheless, because of inaccurate probability estimation for small blocks b, splitting may give unpredictable results.