The genetic material of living organisms is composed of very long polymers of chemical sub-units known as nucleotides. The inheritable genetic material of bacteria and all multicellular is made up of the polynucleotide deoxyribonucleic acid or DNA, while the polynucleotide ribonucleic acid, or RNA, serves an intermediary function in DNA activity and also serves as the inheritable genetic material of certain viruses.
DNA may be modeled as a very long chain in which each link in the chain is one of four nucleotide sub-units, adenine, thymine, cytosine, or guanine, which are respectively represented conventionally by the letters A, T, C, and G. RNA is composed of a similar chain of four nucleotides, which are the same as DNA except that uracil (U) substitutes for the T (thymine). DNA is natively double stranded, with each A on one strand being opposite a T on the other strand, and vice versa, and with each C on a strand being opposite a G on the other strand, and vice versa. RNA, which is usually single stranded, is typically made from DNA by a similar matching process, with U substituted for T. It is thus possible, and is conventional in the art, to represent nucleotide sequences, whether in print or in computer generated storage or display, by a sequence of single letters, i.e. "CTTAGATGCCTAC" etc.
In living organisms, it is one of the main functions of DNA to provide a code for the production of proteins. Proteins are also biological chains, or polymers. In proteins, the sub-units in the chain are known as amino acids. There are twenty amino acids which are used by living organisms to make proteins. These twenty amino acids are listed in Appendix 1 hereto. Amino acids are conventionally referred to in one of two ways, a three letter code or a single letter code. Both the conventional three letter code, and the conventional single letter code, for each amino acid is listed in Appendix 1.
The process of using DNA to make proteins begins with making a form of RNA, referred to as mRNA (for message RNA) from a portion of the long DNA strand. Then the mRNA is used as a template in the cell to join or link amino acids into proteins. Each set of three nucleotides of the mRNA specifies one amino acid of the protein. The three nucleotides in the mRNA is, of course, specified exactly by the sequence of nucleotides in the DNA, and the three nucleotides in the DNA which correspond to the particular amino acid are referred to jointly as a codon. The particular amino acid specified by each possible codon is well known and available in printed tables.
As more and more genes and other pieces of genetic material are analyzed and sequenced, the amount of data composed of the nucleotide and protein sequences known to science has grown enormously. It has therefore become common to store nucleotide and protein sequences on computers to make use of the ability of computers to analyze, match, or perform other useful manipulations with the nucleotide or protein sequences. Of course, for the output of such activities to be useful to society, the output of such computerized processes must result in a representation accessible to people. Typically, of course, computers communicate their output to their users through displays, such as CRT displays, and through hard copy output, such as produced by a printer or plotter.
One useful form of such a computer display or hard copy print-out of a nucleotide sequence is the generation or matching of nucleotide, particularly DNA, sequences and protein sequences. It is most common to represent DNA sequences by the single letter nucleotides and to represent amino acids by the three letter sequences. The three letter sequences are preferred for amino acids, since they are better recognized by users. Shown in FIG. 1 is a representation of such a sequence as it conventionally would appear in the prior art.
In FIG. 1, the sequence of nucleotides and amino acids are presented in a so-called monospace font. This terminology implies that each character of the font takes up just the same width on the page, or the CRT screen, as any other character. So, for example, an "I" is as wide as an "M" or a "W." Since there are three letters for each codon and three letters for each amino acid, the sequences align perfectly. Unfortunately, this makes the amino acid sequence, in particular, relatively difficult to read and analyze due to the lack of spacing between the letters.
Thus with the advent of desk top publishing and other more sophisticated forms of data and graphic representations and features in computers, two subtle problems arise in the use and display of nucleotide and protein sequences. One problem is that many computer users prefer to create output products in one of the many available fonts which provide a pleasing type-like, as opposed to typewriter-like, appearance in the display or printed copy. This is impractical in the display of nucleotide sequences since, in most of those fonts attractive for making print-style appearance, the characters of the font are of a variable or proportionate width. Unfortunately, the use of a proportionate width font prevents the nucleotide and amino acid sequences from properly aligning on the display screen or printed page. While this difficulty can be avoided by use of a monospace font, such as in FIG. 1, in which each character is the same width, the typical monospace fonts available, such as the widely used Courier, are not considered very aesthetically appealing.
The second difficulty arises in the representation of the amino acids in the sequence of the protein. If the three letter abbreviations are used, to facilitate user recognition of the amino acids, the listing appears crowded and difficult to read, since each abbreviation for an amino acid takes up precisely the space of the three-nucleotide codon. The three letter amino acid abbreviations thus run continuously, with no breaks between the amino acids, as can be seen in FIG. 1.
One frequently used solution for this problem is to list the nucleotide sequence with spaces between the codons, and to leave corresponding spaces between the three letter amino acid abbreviations so that the codons and amino acids correspond. While this strategy makes the amino acid abbreviations more readible, it has the disadvantage of reducing by one quarter the amount of information which can be displayed in the same display space. Another drawback of this strategy arises from the fact that DNA sequences can have different "reading frames," which refer to the possible alternative sets of codons possible based on the same sequence depending on where the codons are deemed to start and in which direction the coding proceeds. If the spacing strategy is used, four of the other five possible reading frames cannot be represented by amino acid sequences corresponding to the DNA sequence.