The present invention relates generally to the analysis of sequence data. More particularly, the present invention pertains to defining physical-chemical property (PCP) based sequence motifs common to a family of related proteins and/or the use thereof in analysis of sequence data (e.g., DNA, RNA, amino acids), such as, for example, searching genomic sequence databases to identify homologues of the family of proteins.
With improvement of technology, the amount of sequence data available for analysis is accumulating very quickly. The ability to analyze such sequence data depends significantly on the development of advanced computational tools for rapid and accurate annotation of genomic sequences as to the probable structure and function of the proteins they encode.
As such, one of the most challenging goals of genome sequencing projects is to functionally annotate novel gene products (Kelley, et al., 2000 and Rison, et al., 2000). A sequence can be recognized as a homologue of a known protein if the pair-wise sequence identity/similarity exceeds a statistically derived threshold (e.g., more than 30% sequence identity or an E-value less than 0.001) (Chothia, et al., 1986). These global criteria identify only a small fraction of proteins known to be functionally related, as amino acids patterns are differently conserved.
Determining the similarity of sequences in databases to that of proteins of known function has been one of the most direct computational ways of deciphering codes that connect molecular sequences of protein structure and function. There are various algorithms and software available for sequence database searching and sequence analysis which, for example, may provide for comparisons between query sequences and sequence data (e.g., sequence data in a molecular database). Sequence profile searches (Bowie, et al., 1991; Gribskov, et al., 1996; Mehta, et al., 1999; Rychlewski, et al., 2000; and Schaffer, et al., 1999) and Hidden Markov Models (Eddy, S. R., 1998) generate position specific fingerprints of the amino acid sequences in protein families and can identify distantly related proteins. However, the optimal choice of parameters for high sensitivity/specificity depends on the expert user. A further complication is that enzymes often combine functional elements to create a specific catalytic center. These elements, due to crossover events, may not occur in the same linear fashion in the sequence of related proteins and are not found with global profiles.
Analytical tools that use statistically derived matrices based on allowed substitution of amino acids, are not designed to detect conservation of physical-chemical properties. For example, such tools include those available under the trademark or the trade designation of FASTA, PSI-BLAST, or BLOCKS, such as described in Pearson W., Rapid and sensitive sequence comparison with FASTP and FASTA, Methods in Enzymology, 1990, 183:63-98; Schaffer et al., Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements, Nucleic Acids Res, 2001, 29(14): 2994-3005; Schaffer et al., IMPALA: matching a protein sequence against a collection of PSI-BLAST constructed position-specific score matrices, Bioinformatics, 1999, 15: 1000-1011; Altschul et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 1997, 25(17): 3389-3402; and Henikoff et al., Increased converage of protein families with the blocks database servers, Nucleic Acids Res, 2000, 28: 228-230.
Other database searching tools that are based on information derived from a family of related proteins include, for example, a screening for motif patterns such as described in U.S. Pat. No. 5,845,049, to Wu, issued Dec. 1, 1998, and entitled “Neural network system with n-gram term waiting method for molecular sequence classification and motif identification.” Wu describes a method using a neural network that is trained for extraction of sequence motifs. Further, for example, other analysis processes may use other fold recognition tools (e.g., U.S. Pat. No. 6,512,981 B1, to Eisenberg, et al., issued Jan. 28, 2003, entitled “Protein fold recognition using sequence-derived predictions”). Eisenberg et al. describes a method that relies, to a large extent, on the knowledge of 3D protein structures for fold assignment.
Most genome sequencing projects represent their results in databases including large collections of sequences. As such, a critical step in the selection of potential drug targets among novel gene products is functional annotation. Global sequence similarity criteria can only identify a small fraction of proteins known to be functionally related, as amino acid patterns are not uniformly conserved and it is not known what physical-chemical properties are conserved. The large number of potential physical-chemical properties makes it difficult to know ‘a priori’ which of these properties and at what positions in the protein sequence these properties are conserved. A process for deriving five descriptors for amino acids using 237 physical-chemical properties is described in the article by Venkatarajan, M. S. and Braun, W. (2001), entitled “New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties,” J. Mol. Model., 7, pp 445-453.
The available or described sequence data analysis methods range from very sensitive, but computationally intensive algorithms, to relatively rapid, but less sensitive analysis methods. As such, although various analysis tools are available, there is still a need for the development of database search methods that are relatively rapid and also relatively more sensitive. Further, there is always a need for processes that use functional information (e.g., physical-chemical property information) to effectively extract useful information from sequence data being searched.