1. Field of the Invention
The present invention relates to a method to extract functional information from aligned protein sequences that can identify functional variance even in biased datasets. Major applications include, but are not limited to, design of multivalent vaccines, targets for drug design, novel enzymes, and diagnostic kits for differentiating infectious organisms.
2. Description of the Related Art
The most useful information gleaned from aligned sequences of protein families is first, the absolutely conserved residues, which are usually those that maintain the structure of the protein and its primary functions. The second characteristic is variance. Variance can arise at specific positions in a random fashion, or can represent a true change that may correlate with alteration in phenotype or activity. The problem in dealing with biological datasets, such as sequences for viral or microbial genomes, is that they often have a pronounced bias due to inequivalent distribution. This unequal distribution can arise from non-uniform sampling, for example, there may be many closely related sequences from one epidemic, but only a few from normal infections in a year when the virus had a less lethal phenotype.
Unbiased data reduction methods are needed to make practical use of large volumes of sequence data. To design vaccines, or protein targets for drug design, it may be necessary to analyze both the conservation and the variance in very large numbers of sequences. In practice, this is often done by determining a consensus sequence for reference, that reflects the most commonly occurring amino acid, or type of amino acids, in a given column of an aligned sequence. Conventional methods for calculating consensus sequences cannot account for dataset bias, as they determine the amino acid that occurs most frequently, thus eliminating information on variants at a given position. Even when such averaging is done over a closely related series of sequences, numerical averaging can eliminate important information on the functional importance of substitutions that conserve the physicochemical properties at a position that may be essential for the function or fold of the protein. While some calculation methods for consensus sequences take into account amino acid groupings according to charge, size or hydrophocity, one dimensional averaging method cannot deal with highly variant positions, where the underlying conserved physicochemical properties are less obvious.
There is a recognized need in the art for improved methods to determine where sequence variance can indicate a more severe disease or alter phenotype and functions significantly. Specifically, the prior art is deficient in the lack of unbiased data reduction and computational methods for calculating consensus sequences based on the multidimensional physicochemical properties of amino acids. The method is essential for designing novel proteins that can be used for multivalent subunit vaccines or as targets for drug design. The present invention fulfills this longstanding need and desire in the art.