The human genome contains somewhere between 50,000 and 150,000 genes. Current efforts to identify protein drug targets focus on a small number of “drug-proven” protein families such as kinases, proteases, nuclear hormone receptors, transmembrane proteins, chemokines, and cytokines. These protein families are referred to as “drug-proven” because a number of proven drugs and validated, screened targets are based upon proteins found in these families. In order to maximize the number of new drugs brought to market (currently, only about 5% of drug development projects reach the market) many companies are directing their efforts on identifying and characterizing novel members of drug-proven protein families through genome wide sequence homology searching.
Sequence alignment methods are applicable to genomic and proteomic sequences and attempt to identify the function of a given sequence by detecting the similarity of a query sequence to other sequences of known structure or function. To the extent a sequence is homologous to another sequence it may be expected that either the two genes, two gene products or proteins share similar structural and functional characteristics. Accordingly, if the known gene expression product is a drug target, homology modeling methods may be used to identify the genes/cDNA sequences corresponding to novel potential drug targets. Further, to the extent the structure and function of the known target has been characterized by other techniques, such as determination of the target's biological function, its three-dimensional structure, or the presence/absence of active sites or protein-protein interaction sites, similar structural and functional assignment may be made to the potential drug target.
Among its uses for target identification and characterization, sequence comparison methods may be used for determining gene function maps by comparing the sequences of complete cDNA copies or cDNA fragments of known function against genomic DNA. Gene function maps may be used for developing DNA binding drugs which turn on/off a gene or gene cluster.
Homology modeling based upon sequence comparison methods may be used to identify and characterize the function, three-dimensional structure and active regions of potential protein drug targets by comparing the sequence relatedness, or the sequence homology, between complete cDNA copies or cDNA fragments of known protein drug targets against complete cDNA copies or cDNA fragments of unknown gene expression products.
Sequence comparison methods, BLAST algorithms, Hidden Markov Models and the various Smith-Waterman based techniques assign an alignment score based upon the sequence similarity of two sequences. An unnormalized, raw alignment score is calculated based upon residue substitution probabilities, residue insertion/deletion penalties and background residue probabilities. The highest scoring alignment between two sequences is referred to as the optimal alignment.
In order to assure that an alignment score is statistically meaningful—i.e. to show the degree a particular alignment score varies from the alignment score expected from aligning two random sequences—it is necessary to normalize a raw alignment score. In the art, one of the most common measures for representing whether a raw alignment score is statistically meaningful is with the p-value of the alignment score. The p-value of a raw alignment score, x, gives the probability of finding an alignment with score S of at least x for the alignment of two randomly selected sequences of the same length as those sequences which produced the alignment score x. It has been shown that when gaps are not allowed and in the limit of large sequence lengths m and n, the p-value of score x may be represented as:P(S≧x)≈1−exp(−Kmne−λx)  1where λ and K are scaling parameters. Karlin, S. and Altschul, S. F., Methods for assessing the statistical significance of molecular sequence features by using general scoring schemas, Proc. Natl. Acad. Sci. USA, 87 (1990), pp. 2264-2268; Dembo, A., Karlin, S. and Zeitouni, O., Limit distribution of maximal non-aligned two-sequence segmental score, Ann. Prob., 22 (1994), pp. 2022-2039. These references and each other reference herein are hereby incorporated in their entirety as if fully set forth herein. Many computational experiments suggest that the same formula applies to the statistics of gapped sequence alignments. In this case, λ and K must be established from a large scale comparison of random sequences. P(S≧x) will be referred to throughout as P(x|K,λ).
A number of approximations have been employed to estimate P(S≧x). The latest version of PSI-BLAST, which is generally regarded as the most sensitive of the BLAST algorithms, pre-calculates the scaling parameters A and K using Island statistics, for a plurality of randomly generated sequence pairs of varying length, a plurality of substitution matrices and a plurality of gap penalties. The residue frequency in the random sequences is chosen to reflect background residue frequencies. For each query/template sequence pair, PSI-BLAST calls the look-up table and selects a particular set of pre-calculated λ and K parameters based upon the similarity in length of the query/template sequences to the random generated sequences and further based upon the identity of the gap scoring and substitution matrices employed for the template/query pair. While PSI-BLAST's look-up table is computationally efficient to generate, its efficiency comes at the cost of accuracy. More particularly, it assumes background residue frequency and the granularity in length sampling and gap penalties introduce further errors. Greater accuracy could be achieved if λ and K are determined for each query/template pair that is aligned, but only at the cost of substantially slowing a PSI-BLAST search.
Island Method
The Island method is a computationally efficient method for determining λ and K and thereby determining Function 1 from a plurality of Smith-Waterman matrices. Olsen, R., Bundschuh, R., and Hwa, T., Rapid assessment of external statistics for gapped local alignments, Proceedings of the Seventh International Conference on Intelligent Systems (AAAI Press, Menlo Park, Calif., 1999), pp. 211-222. The value of each cell in a Smith-Waterman matrix corresponds to the highest scoring local alignment that ends at that particular cell. An ‘island’ consists of all those cells connected to a common anchor cell. The score assigned to an island is the maximum score of the cells that comprise the island.
The Island method generates a large number of island scores from a plurality of Smith-Waterman matrices formed from either 1) aligning multiple randomly selected sequences; or 2) aligning multiple residue ‘shuffles’ of the same two sequences. Since Equation 1 becomes increasingly accurate for larger values of x, improved estimates of λ may be obtained by only considering those islands with a score at least c. Altschul, S. F., Bundschuh, R., Olsen, R., and Hwa, T., The estimation of statistical parameters for local alignment score distributions, Nucleic Acids Research, 29-2 (2001), pp. 351-361. Altschul et al. have shown that the maximum-likelihood estimate of λ for the case of discrete alignment scores may be expressed as:                               λ          ^                =                  ln          ⁡                      (                          1              +                              1                                                      S                    _                                    c                                                      )                                      3                                                S            _                    c                =                              1            N                    ⁢                                    ∑                              iεI                c                                      ⁢                                                   ⁢                          [                                                S                  ⁡                                      (                    i                    )                                                  -                c                            ]                                                  4      where S(i) is the score of the i'th island, Ic={i|S(i)≧c} and N=|Ic|. For the case of continuous scores, such as found when aligning sequence profiles,       λ    ^    =            1                        S          _                c              .  
Altschul et al. have also shown that the maximum likelihood estimate for K may be expressed as:                               K          ^                =                                            R              c                        ⁢                          ⅇ                                                                    λ                    ^                                    c                                ⁢                c                                              A                            5      where A is the aggregate search area of the island search space. For example, if two sequences, of length m, and n, were compared once, A=mn. If B such comparisons were made, A=Bmn. For the sake of simplicity, the following examples and discussion will assume continuous alignment scores and use       λ    ^    =            1                        S          _                c              .  
The present invention relates to an improved method of performing Island statistic based normalization of alignment scoring. More particularly, the present invention relates to a heuristic that efficiently uses Island statistics to quickly determine the statistical significance or insignificance of a raw alignment score produced by aligning a first sequence, usually the query sequence, to a second sequence, the template sequence. Because the methods of the present invention do not require ‘look-up’ tables a significant improvement in alignment sensitivity may be gained when a large database of template sequences is screened. Since the methods of present invention are independent of the alignment scoring scheme, sequence lengths and their composition, the methods are generally applicable to any dynamic programming based alignment method that considers local alignments. Accordingly, the claimed methods may be used equally by BLAST, PSI-BLAST, FASTA, HMMER or Eidogen's STRUCTFAST method disclosed in U.S. patent application Ser. No. 09/905,176.