The identification of sequence homology between an unknown biopolymer test sample and a known gene or protein often provides the first clues about the function and/or the three dimensional structure of a protein, or the evolutionary relatedness of genes or proteins. Because of the recent explosion in the amount of DNA sequence information available in public and private databases as a result of the human genome project and other large scale DNA sequencing efforts, the ability to screen newly discovered DNA sequences against databases of known genes and proteins has become a particularly important aspect of modern biology.
Generally, the sequence comparison problem may be divided into two parts: (1) alignment of the sequences and (2) scoring the aligned sequences. Alignment refers to the process of introducing "phase shifts" and "gaps" into one or both of the sequences being compared in order to maximize the similarity between two sequences, and scoring refers to the process of quantitatively expressing the relatedness of the aligned sequences.
Existing sequence comparison processes may be divided into two main classes: global comparison methods and local comparison methods. In global comparison methods, the entire pair of sequences are aligned and scored in a single operation (Needlman and Wunsch), and in local comparison methods, only highly similar segments of the two sequences are aligned and scored and a composite score is computed by combining the individual segment scores, e.g., the FASTA method (Pearson and Lipman), the BLAST method (Altschul) and the BLAZE method (Brutlag).
Application of existing alignment-based similarity scoring methods is problematic in applications where a high degree of sensitivity is required, i.e., where very similar sequences are being compared, e.g., two 1500-base 16S rDNA sequences differing by only 1-5 bases. An alignment-based similarity score, especially one based on local alignments such as FASTA (Pearson and Lipman) or BLAST (Altschul), will tend to emphasize the similarity of sequences and overlook small differences between them. In applications where small differences are critical, e.g., distinguishing the 16S RNA sequences of E. Coli K-12 (benign) and E. Coli O157 H:7 (pathogenic), it is crucial to be able to detect small differences between sequences rather than similarities.
An additional shortcoming of existing similarity scoring methods is that they fail to take into account the polymorphic nature of the sequences being compared, i.e., the fact that more than one monomer unit may be present in a given sequence at a given position, and that the proportion of each monomer at that position may be variable such that a minor component may go undetected. Such polymorphisms can arise when the sequencing template is a polymorphic multicopy gene which has been amplified by the PCR. For example, consider a set of sequences which are polymorphic at a position m, e.g., sequences derived from a sample including 10 copies of a polymorphic gene. Furthermore, assume that the polymorphism is such that in 8 of the copies of the gene the nucleotide at position m is an A and in the remaining two copies of the gene the nucleotide is a C. Thus, in an ideal sequencing experiment, each of the members of the set would show a signal having an 80% A component and a 20% C component at position m. However, in reality, many automated sequencing methods do not have the capability to reliably detect the presence of a 20% minor component. In such a case, the basis set would show only an A nucleotide at position m while the true situation would be that 20% of the polymorphic genes have a C at that position. Using existing similarity scoring methods, position m would be deemed to be a non-match, i.e., existing methods would erroneously conclude that a test sequence that included a C at position m was not a member of the set of known sequences.
Thus, what is needed is an alignment-based similarity scoring method (i) capable of quantitatively distinguishing very similar sequences and (ii) capable of taking into account the polymorphic nature of many biopolymer sequences in light of the inability of current sequencing technology to reliably detect a polymorphic nucleotide present as a minor component.