The present invention relates generally to database searches and, more particularly, to methods and apparatus for detecting sequence homology between a query sequence and sequences in a database in association with a given application, e.g., genetic research.
In the area of genetic research, the first step following the sequencing of a new gene is an effort to identify that gene""s function. The most popular and straightforward methods to achieve that goal exploit the following biological factxe2x80x94if two peptide stretches exhibit sufficient similarity at the sequence level (i.e., one can be obtained from the other by a small number of insertions, deletions and/or amino acid mutations), then they probably are biologically related. Examples of such an approach are described in A. M. Lesk, xe2x80x9cComputational Molecular Biology,xe2x80x9d Encyclopedia of Computer Science and Technology; A. Kent and J. G. Williams editors, 31:101-165, Marcel Dekker, New York, 1994; R. F. Doolittle, xe2x80x9cWhat we have learned and will learn from sequence databases,xe2x80x9d Computers and DNA, G. Bell and T. Marr editors, 21-31, Addison-Wesley, 1990; C. Caskey, R. Eisenberg, E. Lander, and J. Straus, xe2x80x9cHugo statement on patenting of DNA,xe2x80x9d Genome Digest, 2:6-9, 1995; and W. R. Pearson, xe2x80x9cProtein sequence comparison and protein evolution,xe2x80x9d Tutorial of Intelligent Systems in Molecular Biology, Cambridge, England, 1995.
Within this framework, the question of getting clues about the function of a new gene becomes one of identifying homologies in strings of amino acids. Generally, a homology refers to a similarity, likeness, or relation between two or more sequences or strings. Thus, one is given a query sequence Q (e.g., the new gene) and a set D of well characterized proteins and is looking for all regions of Q which are similar to regions of sequences in D.
The first approaches used for realizing this task were based on a technique known as dynamic programming. This approach is described in S. B. Needleman and C. D. Wunsch, xe2x80x9cA General Method Applicable To The Search For Similarities In The Amino Acid Sequence Of Two Proteins,xe2x80x9d Journal Of Molecular Biology, 48:443-453, 1970; and T. F. Smith and M. S. Waterman, xe2x80x9cIdentification Of Common Molecular Subsequences,xe2x80x9d Journal Of Molecular Biology, 147:195-197, 1981. Unfortunately, the computational requirements of this method quickly render it impractical, especially when searching large databases, as is the norm today. Generally, the problem is that dynamic programming variants spend a good part of their time computing homologies which eventually turn out to be unimportant.
In an effort to work around this issue, a number of algorithms have been proposed which focus on discovering only extensive local similarities. The most well known among these algorithms are referred to as FASTA and BLAST. The FASTA algorithm is described in W. R. Pearson, and D. J. Lipman, xe2x80x9cImproved tools for biological sequence comparison,xe2x80x9d Proc. Natl. Acad. Sci., 85:2444-2448, 1988; and D. J. Lipman, and W. R. Pearson, xe2x80x9cRapid and sensitive protein similarity searches,xe2x80x9d Science, 227:1435-1441, 1989. The BLAST algorithm is described in S. Altschul, W. Gish, W. Miller, E. W. Myers, and D. Lipman, xe2x80x9cA basic local alignment search tool,xe2x80x9d J. Mol. Biology, 215:403-410, 1990. In the majority of the cases, increased performance is achieved by first looking for ungapped homologies, i.e., similarities due exclusively to mutations and not insertions or deletions. The rationale behind this approach is that in any substantial gapped homology between two peptide strings, chances are that there exists at least a pair of substrings whose match contains no gaps. The locating of these substrings (the ungapped homology) can then be used as the first step towards obtaining the entire (gapped) homology.
Identifying the similar regions between the query and the database sequences is, however, only the first part (the computationally most demanding) of the process. The second part (the one that is of interest to biologists) is evaluating these similarities, i.e., deciding if they are substantial enough to sustain the inferred relation (functional, structural or otherwise) between the query and the corresponding data base sequence(s). Such evaluations are usually performed by combining biological information and statistical reasoning. Typically, similarity is quantified as a score computed for every pair of related regions. Computation of this score involves the use of gap costs (for gapped alignments) and of appropriate mutation matrices giving the evolutionary probability of any given amino acid changing into another. Examples of these matrices are the PAM matrix (see M. O. Dayhoff, R. M. Schwartz and B. C. Orcutt, xe2x80x9cA model of evolutionary change in proteins,xe2x80x9d Atlas of Protein Sequence and Structure, 5:345-352, 1978) and the BLOSUM matrix (see S. Henikoff and J. G. Henikoff, xe2x80x9cAmino acid substitution matrices from protein blocks,xe2x80x9d Proc. Natl. Acad. Sci., 89:915-919, 1992). Then, the statistical importance of this cost is evaluated by computing the probability (under some statistical model) that such a score could arise purely by chance, e.g., see S. Karlin, A. Dembo and T. Kawabata, xe2x80x9cStatistical composition of high-scoring segments from molecular sequences,xe2x80x9d The Annals of Statistics, 2:571-5 81, 1990; and S. Karlin and S. Altschul, xe2x80x9cMethods for assessing the statistical significance of molecular sequence features by using general scoring schemes,xe2x80x9d Proc. Natl. Acad. Sci., 87:2264-2268, 1990. Depending on the statistical model used, this probability can depend on a number of factors such as: the length of the query sequence, the size of the underlying database, etc. No matter, however, what conventional statistical model one uses there are always the so called xe2x80x9cgray areas,xe2x80x9d i.e., situations where a statistically unimportant score indicates really a biologically important similarity. Unfortunate as this might be, it is also inescapable; there is after all a limit to how well a statistical model can approximate the biological reality.
An alternative to the inherent difficulty of attaching statistical importance to weak similarities is the use of biological knowledge in deducing sequence descriptors that model evolutionary distant homologies. BLOCKS (see S. Henikoff and J. Henikoff, xe2x80x9cAutomatic Assembly of Protein Blocks for Database Searching,xe2x80x9d Nucleic Acids Research, 19:6565-6572, 1991) is a system that employs pattern-induced profiles obtained over the protein classification defined in the PROSITE (see S. Henikoff and J. Henikoff, xe2x80x9cProtein Family Classification Based on Searching a Database of Blocks,xe2x80x9d Genomics, Vol. 19, pp. 97-107, 1994) database in order to functionally annotate new genes. The advantage here is that this classification is compiled by experts working with families of proteins known to be related. As a result, even weak similarities can be recognized and used in the annotation process. On the other hand, there is only that much knowledge about which proteins are indeed related and consequently being representable by a pattern. Furthermore, there is always the danger that a family of proteins actually contains more members than is currently thought of. By excluding these other members from consideration, it is possible to get patterns that xe2x80x9cover fitxe2x80x9d the family, i.e., they are too strict to extrapolate to the unidentified family members.
Therefore, it is evident that there exists a need for methods and apparatus for creating improved pattern dictionaries through unique dictionary formation techniques that permit improved sequence homology detection, as well as a need for methods and apparatus for sequence homology detection, itself, which are not limited to searching only annotated sequences.
The present invention provides solutions to the above and other needs by providing improved pattern dictionary formation techniques and improved sequence homology detection techniques, as will be described in greater detail below.
In a sequence homology detection aspect of the invention, a computer-based method of detecting homologies between a plurality of sequences in a database and a query sequence comprises the following steps. First, the method includes accessing patterns associated with the database, each pattern representing at least a portion of one or more sequences in the database. Next, the query sequence is compared to the patterns to detect whether one or more portions of the query sequence are homologous to portions of the sequences of the database represented by the patterns. Then, a score is generated for each sequence detected to be homologous to the query sequence, wherein the sequence score is based on individual scores generated in accordance with each homologous portion of the sequence detected, and the sequence score represents a degree of homology between the query sequence and the detected sequence.
In a dictionary formation aspect of the invention, a computer-based method of processing a plurality of sequences in a database comprises the following steps. First, the method includes evaluating each of the plurality of sequences including characters which form each sequence. Then, at least one pattern of characters is generated representing at least a subset of the sequences in the database. The pattern has a statistical significance associated therewith, the statistical significance of the pattern being determined by a value representing a minimum number of sequences that the pattern supports in the database.
Accordingly, in a significant departure from prior art approaches, the methodologies of the invention are based on the unsupervised pattern discovery performed on arbitrary data bases without requiring any prior partition of the database. The BLOCKS approach assumes that the database has been partitioned (by outside experts) in subsets of biologically related sequences. Profiles are then obtained by individually processing each subset. As a result of this approach, BLOCKS cannot handle arbitrary databases, since not all such databases are partitioned in related subsets. In fact, BLOCKS works only with the SwissProt database, referenced herein, using the protein groups described in the PROSITE database, also referenced herein. The present invention, on the other hand, preferably uses the entire database as its input and provides for an automated methodology to decide which patterns are important and which are not.
Further, the present invention provides a new statistical framework for evaluating the statistical importance of the discovered patterns. Unlike existing frameworks, the approach of the invention introduces the concept of memory in its computations. That is, for example, when a region A on the query sequence is compared with a region B on some database sequence, the resulting similarity score is evaluated by taking into account the similarity of A to all the other sequences in the database.
The use of the enhanced statistical model described herein allows the detection of important local similarities which, using existing approaches, would go undetected. This allows the system of the invention to perform similarity searches at a higher level of sensitivity than is possible using prior art systems.
Still further, the present invention provides an automated method to utilize partial annotation information available in the underlying database D. This methodology allows the user to exploit in greater detail similarities that seem unimportant. For example, when a pattern matches the query sequence region A, all the database regions also matching that pattern can be inspected. If all (or more) of these database regions are annotated in the same way, then this annotation can be transferred to the query region A. Partially annotating the query sequence in the above manner can prove useful towards the overall sequence annotation.
The present invention also provides for a detailed methodology to cluster the database into groups of highly homologous sequence. In a genetic data processing application, this methodology allows for the correct treatment of multi-domain proteins.
It is also to be appreciated that the inventive concepts described herein may be implemented on a network such as, for example, the Internet, in a client-server relationship. This allows a user to enter a query sequence at a client device at a remote location that is transmitted to a server over the network and processed at the server. The server then returns the results of the homology search to the client device of the user via the network.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.