The invention relates generally to algorithmic information theory, and more specifically, to the identification of sequences of interest in a given data series.
In various applications, such as information theory, data compression, and intrusion detection, it may be desirable to identify sequences of interest within a data series. It may be advantageous to identify such sequences of interest in order to extract meaningful information from the identified sequences or to allow easier manipulation or analysis of the data series. For example, identification of repetitive sequences in a data series may allow easier or more effective compression of the data.
Similarly, in the field of genetics, biologically interesting phrases or sequences in a genome, such as the human genome, may have higher redundancy than non-meaningful phrases, as nature tends to repeat or emphasize important sequences more frequently than unimportant sequences. However, for the genomes, which are known or are being sequenced, the purposes of different parts of the genomes are currently unknown. Hence, the identification of meaningful or interesting sequences within a genome may pose a challenge.
Furthermore, it is increasingly difficult to identify meaningful sequences of interest employing traditional techniques. In particular, the vast amount of data, such as genome data is difficult to analyze using traditional techniques in a computationally efficient manner. In addition, existing computational techniques to determine meaningful information may be inadequate for the identification of sequences of interest. For example, existing techniques may fail to identify DNA sequences in a genome that are known to be of interest, such as sequences experimentally demonstrated to be of interest. Hence, it may be desirable to develop techniques that efficiently and accurately recognize sequences of interest within a data series.