Within the past few years, the amount of biological information available in databases and accessible via the World Wide Web is increasing at a fast pace. The biggest part of this information is made up of DNA sequences derived from more and more efficient DNA sequencing methods. However, DNA sequencing methods only provide raw data, among which the scientist then has to find what is important. The important parts may be coding sequences, splice sites, regulatory sequences like promoters and terminators, polyadenylation sites etc. Selecting the sequence of interest from the wealth of sequence data is essential, since the “real” experiments at the laboratory bench performed to analyze the molecules containing the sequence and/or their products require a big effort in terms of time and resources. Experiments based on the molecules taken from the database aim at elucidating structure and function of these biomolecules. These experiments may then lead to finding new drugs or drug targets, for example.
Therefore, the sequence data present in a database has to be carefully analyzed and evaluated, in order to sort out the sequences of interest to the particular research project.
Being interested in a certain protein or a protein family (i.e. related proteins sharing common motifs, which may be domains or certain amino acid residues or patterns of residues), the researcher is often faced with the problem that only a member in one specific type of organism has been characterized. It is known that the sequences of homologous proteins can diverge greatly in different organisms, even though the structure or function change little. Thus, much can be inferred about an uncharacterized protein when significant sequence similarity is detected with a well-studied protein. Therefore, a database search, i.e. a sequence comparison or alignment, is performed in order to find other family members and/or related molecules in other types of organisms. Homologous family members in different organisms are called orthologs.
Databases like Swissprot, GenBank or the EMBL (European Molecular Biology Laboratory) Data Library are large sequence archives containing large amounts of sequence data. The databases contain sets of sequences stemming from different organisms. In these databases, searches for orthologs can be performed starting from a query sequence which is aligned with the sequences in a database, the target sequences. A score, defining the similarity, is computed for each alignment, and the query-target pairs are reported to the user. The score or similarity value can be set to a certain threshold or “cut-off value”, so that only those pairs having a similarity exceeding the threshold are reported to the user.
Different programs or algorithms have been developed to perform database searches. The Smith-Waterman algorithm (1) rigorously compares the query sequence with every target sequence in the database. This algorithm requires time proportional to the product of the lengths of sequences compared. Without special-purpose hardware or massively parallel machines the time required by the Smith-Waterman algorithm is usually too slow for most users. Much quicker programs for database searches use heuristics to speed up the alignment procedure. The most commonly used programs of this kind are called BLAST and FASTA, both concentrating the alignment on the sequence regions most likely to be related. Rapid exact-mach procedures first identify promising regions, and only then is the Smith-Waterman method applied.
Newly identified DNA sequences can be classified using known nucleic acid or amino acid sequence motifs that indicate particular structural or functional elements. The motifs can then be used for predicting the function of a newly identified sequence.
More sensitive sequence comparisons can be carried out using sequence families, preferably conserving certain critical residues and motifs. All the members of the family or putative family members are used for the search. Using multiple sequence comparisons, gene functions may be revealed that are not clear from simple sequence homologies.
In order to find orthologous proteins, Chervitz et al. (2) performed an exhaustive comparison of complete protein sets of the nematode worm Caenorhabditis elegans and the budding yeast Saccharomyces cerevisiae. Both the genome of the yeast and the genome of the nematode C. elegans had been sequenced in totality before (3, 4).
In order to find orthologous relationships, Chervitz et al. performed a reciprocal Washington University (WU)-BLAST analysis (described in 5, 6 and 7). They compared the predicted yeast proteins (6217 ORFs) against all the predicted proteins of the worm (19 099 ORFs) and vice versa, i.e. they performed a reciprocal sequence comparison. Good alignments were detected and grouped together. The groups were then ordered according to their similarity and displayed as multiple sequence alignments, rooted cluster dendrograms and unrooted trees.
This analysis showed that for a substantial fraction of the yeast and worm genes, orthologous relationships were identifiable. This approach of identifying orthologous relationships in different species serves at finding protein functions and activities in newly sequenced genomes.
Reciprocal sequence comparisons are therefore a powerful tool for helping researchers identify their potential target in the database and then design experiments to the specific molecule identified.
One of the difficulties in analyzing the results of database searches as outlined above is the amount of data output obtained by the search. The output has to be carefully evaluated in order to select the significant data from the “background”.
Another difficulty is the ambiguity of the results presented in dendrograms or trees. Pairs of orthologs are not evident, if detectable at all.
A further critical item is the reliability of the analysis. Researchers have to be sure that the sequences they found are unequivocally and truly orthologous pairs, i.e. that they have actually or at least very likely found sequences coding for proteins or domains having a certain activity. The success in finding orthologs using these kinds of database searches is the more likely, the closer evolutionary linked the organisms compared are.
However, most sequence information available today is derived either from mammalian species or from very simple life forms. This situation will be even more lopsided when the full human genomic sequence is known.
The explanation for this situation is that simple organisms have relatively small genomes which are accessible to manipulation, whereas mammalian (human) genetic data are essential as the immediate starting point for the development of pharmaceutical derivatives. But in order to infer the function of a mammalian gene from the analysis of a related gene (an ortholog) of worm or a fly, for instance, by deleting the orthologous gene, one has to be reasonably certain about the evolutionary relationship between those two genes.
The avalanche of sequencing data has increased the number of mammalian genes whose function can potentially be studied in lower organisms, but due to the lack of sequences from evolutionary “intermediate” species it is usually impossible to trace genes all the way through evolutionary trees. This problem is especially prominent for gene families with numerous genes such as kinases, phosphatases and receptors.
As mentioned above, among the multicellular organisms, the genome of the nematode worm Caenorhabditis elegans (C. elegans) has been sequenced in totality (4). Although medical and pharmacological interests tend to focus on mammalian genes, only simple life forms like the nematode allow rapid genetic manipulation and functional analysis. A prerequisite for the meaningful extrapolation of gene functional studies from invertebrates to man is that the pairs of related genes, the orthologs, under study are really related, i.e. unambiguously linked.