When studying molecular binding sites in DNA or RNA, it is conventional practice to align the sequences of several sites recognized by the same macromolecular recognizer and then to choose the most common bases at each position to create a consensus sequence (see Davidson et al., 1983. Nature (London), 301, 468-470). Consensus sequences are difficult to work with and are not reliable when searching for new sites (Sadler et al., 1983b. Nucl. Acids Res. 11:2221-2231; 26 Hawley & McClure, (1983) Nuc. Acids Res.; 11:2237-2255).
This is partly because information is lost when the relative frequency of specific bases at each position is ignored. For example, the first position of Escherichia coli translational initiation codons has 94% Adenine ("A"), 5% Guanine ("G"), 1% Uracil ("U") and 0% Cytosine ("C"), which is not represented precisely by the consensus "A". To avoid this problem, four histograms can be made that record the frequencies of each base at each position of the aligned sequences. Such histograms can be compressed into a single curve by the use of a .chi..sup.2 function (Gold et al., 1981. Annu. Rev. Microbiol. 35, 365-403; Stormo et al., 1982. Nuc. Acids Res. 10, 2971-2996). Although these curves show where information lies in the site, they have several disadvantages: the .chi..sup.2 scale is not easily understood in simple terms; it is difficult to compare the overall information content of two different kinds of sites, such as ribosome binding sites and restriction enzyme sites; and .chi..sup.2 histograms are not directly useful in searching for new site sites (Stormo et al. 1982 Nuc. Acids Res. 10, 2997-3011).
Many general methods exists for identifying sequence changes which are deleterious. However, these methods require experimentation in the laboratory. The most common method is the identification of a disease state and a corresponding genetic mutation in a particular sequence element. This method is quite labor intensive and requires that the mutation produce an identifiable phenotype. Another method uses restriction fragment length polymorphisms to identify alterations within the genome. This method is also experimental, but can only detect alterations in the genome at restriction sites, whether or not a phenotype results.
The average information contained in a set of nucleic-acid binding sites can be calculated by using the methods of information theory, and this has been useful for understanding a number of genetic control systems (Schneider et al., 1986. J. Mol. Biol., 188, 415-431; Schneider & Stormo, 1989. Nuc. Acids. Res., 17, 659-674; Eiglmeier et al. 1989. Mol. Microb., 3, 869-878; Penotti, 1990. J. Mol. Biol., 213, 37-52; Penotti, 1991 J. Theor. Biol. 150, 385-420; Schneider & Stephens, 1990. Nuc. Acids. Res., 18, 6097-6100; Herman & Schneider, 1992. J. Bact., 174, 3558-3560; Gutell et al., 1992. Nuc. Acids. Res., 20, 5785-5795; Papp et al., 1993. J. Mol. Biol., 233, 219-230). However, thus far an effective method does not exist for working with information content of single sequences or for predicting the effect of changes in information content due to sequence alterations--be it through biological evolution or by genetic manipulation.
Information analysis of normal splice junctions reveals partially conserved nucleotide sequences that are not always reflected in the corresponding consensus sequence (Stephens & Schneider, 1992. J. Mol. Biol. 228:1124-1136). Information content may be represented by a sequence logo, which depicts the relative contribution of each position of the splice site and the relative frequencies of each nucleotide at every position (Schneider & Stephens, 1990. Nucl. Acids Res. 18:6097-6100). The logo illustrates the full range of normal variants in the splice junction.
The present invention is principally directed to binding sites on a sequence. In particular, the present "Walker" program enables a scientist or clinician to identify mutations within a nucleic acid binding site which are deleterious, without extensive experimentation. This method generates a model of the binding site which is called the R.sub.i (b,l) weight matrix, which can then be used to evaluate other individual sites for their information content. The present invention allows one to analyze the effect on the binding site of changing a base at a particular position within the site.
The weight matrices of the present invention are not found in the prior art in several respects. R.sub.i values, which represent the sum of all weights at each position within a site, are on an absolute scale, rather than the relative scale found in the prior art. R.sub.i =0 is a cutoff point for functional sites within the present invention. This feature is lacking in both Staden's method (1984 Nuc. Acids Res., 12:505-519) and Berg & von Hippel's method (1987 J. Mol. Biol., 193:723-750; 1988 J. Mol. Biol., 200:709-723; 1988 Nuc. Acids Res., 16(11):5089-5105). Hence, these methods draw no distinction between polymorphisms and mutations.
Moreover, the Berg & von Hippel's method relies upon the consensus sequence as the ideal, i.e. the best binding sequence. Therefore, Berg & von Hippel had no way of distinguishing a polymorphism from a deleterious mutation.
In addition, unlike the prior art (Berg & von Hippel's statistical-mechanical theory, in particular), no assumption about the relationship between energy and information is required to obtain R.sub.i in the present invention. The statistical-mechanical approach assumes that the energy of binding, "discrimination energy", is equal to the information contained within a recognition sequence. This assumption does not allow for a situation where more than one protein could bind to a particular site and thus increase the apparent information contained within that site.
Further, the R.sub.i method described in the present invention is much more sensitive to sequence changes than the widely and almost universally used consensus sequence method. The consensus sequence destroys data by taking the most frequent base at every position as the base used in the consensus model, whereas the R.sub.i method does not alter the frequency data and so can be used to detect subtle effects.
One object of the present invention relates to the use of individual information content of the site and its comparison with the overall distribution of individual information in a set of binding sites, to determine whether a substitution is a polymorphism or a mutation.
Another object of the present invention relates to designing binding sites to adjust the activity of the site. The present invention further relates to a computer system capable of determining the individual information content of a binding sequence and identifying new binding sequences.
Yet another object of the present invention relates to the use of individual information content to determine the effect of a particular position change in a sequence acting as a binding site.
Another object of the invention is to use the "Ri" and "Walker" computer program to display the reaction of a binding macromolecule at every position in a sequence and to determine the change in information content when a particular position within a binding site is altered.
Objects and advantages of the invention set forth herein and will also be readily appreciated here from, or may be learned by practice with the invention. These objects and advantages are realized and obtained by means of instrumentalities and combinations pointed out in the specification and claims.