The present invention relates to visualization of data, and more particularly, but not exclusively, relates to the visualization of biopolymer sequences comprised of different monomer unit types.
Recent success in the whole-genome shotgun sequencing effort has resulted in new opportunities and challenges in bioinformatic research. The genome of an organism is defined by one or more polynucleotide sequences. These sequences are typically comprised of four different types of organic nucleotide bases—Adenine, Cytosine, Guanine, and Thymine—with the total number of nucleotide bases ranging from hundreds of thousands found in bacteria to a few billion for human beings. Adenine, Cytosine, Guanine, and Thymine are commonly represented by the letters A, C, G, and T, respectively. Polynucleotide sequences are typically representative of ribonucleic acid (RNA) or deoxyribonucleic acid (DNA) molecules.
For a given organism, much of a genome is usually nonfunctional with the exception of most microbes. Nonfunctional sequence segments are often referred to as exons. The remaining functional sequence segments, generally referred to as introns, provide genes to encode various sequences of amino acids, producing corresponding proteins (polypeptides). Specifically, three consecutive nucleotide bases of a genetic sequence (a codon) encode one amino acid of a protein. Such proteins, generally in the form of an enzyme, serve as the building blocks of various biologic processes.
As genome and/or protein sequence information accumulates, there is an increasing interest in different ways to analyze such information. One area of particular interest is the comparison of different genomes to identify genes that are responsible for different characteristics of the corresponding organisms. To perform such comparisons, the genomes are aligned with respect to common reference points.
While many organisms—including human beings—have genomes arranged in an open loop with ends that commonly serve as reference points for such alignments, other organisms—including various bacteria—have genomes arranged in a closed loop. This closed loop arrangement can make it relatively more difficult to perform alignments of bacteria genomes. In many cases, minor genomic variations between two bacteria strains may reflect significant differences in their overall characteristics. For example, even though Escherichia coli (E. coli) strain O-157 shares over 90% of sequence homology with E. coli strain K-12, the former is notoriously fatal while the latter is completely harmless to humans.
Various software tools based on dynamic programming and hashing, such as BLAST and FASTA, have been developed to align sequences. These tools are sometimes used to compare sequences with tens of thousands of biomonomer units, as might be found in a single protein or intron segment. However, the performance of such tools often degrades significantly when whole genomes with millions of nucleotides are involved. Furthermore, these tools generally only compare two sequences at a time.
Besides biopolymer sequences, complex polymers comprised of different monomer unit types could also benefit from different evaluation techniques. Moreover, other types of data having an “a priori” order, such as time-series data to name just one example, can benefit from techniques to process large data sequences.
Thus, there is a need for better ways to focus existing analytic tools on those parts of a genome sequence that are of interest. More generally, there is an ongoing need for better ways to evaluate complex polymer sequences comprised of different monomer unit types and/or other data having an a priori order.