The present specification contains a 3 page computer program listing which appears as a microfiche (3 frames) and 131 frames.
The present invention relates to methods for identifying novel genes comprising: (i) generating one or more specialized databases containing information on gene/protein structure, function and/or regulatory interactions; and (ii) searching the specialized databases for homology or for a particular motif and thereby identifying a putative novel gene of interest. The invention may further comprise performing simulation and hypothesis testing to identify or confirm that the putative gene is a novel gene of interest.
Specifically, the present invention provides for the generation of specialized databases containing information on gene/protein structure, function and regulatory interactions based on the retrieval of such information from research articles and databases, and computer representation of such information in a manner that allows efficient access to the extracted information. The invention further provides for the use of the specialized databases for identifying novel genes based on detection of sequence similarities and domain/motif matches between genes/proteins, computation and interpretation of phylogenetic trees for multigene families, and analysis of homologous regulatory networks. The methods of the invention are based on the observation that functionally similar regulatory systems are generated during evolution by genetic duplication of ancestral genes. Thus, a comparison of homologous/similar networks within the same organism and between different species will allow the identification of genes absent in one of the systems under comparison. In this way genes that contribute to the phenotype of a specific disease associated with a particular biological system under analysis may be identified.
A variety of different methods are currently utilized for the identification and characterization of novel genes. Perhaps the most widely used method for generating large quantities of sequence information is via high throughput nucleotide sequencing of random DNA fragments. A disadvantage associated with this gene discovery technique is that in most instances when genes are identified their function is unknown.
For identification of specific disease genes, positional cloning is currently the most efficiently used method. The positional cloning approach combines methods of formal genetics, physical mapping and mutation analysis and usually starts with a precise description of the disease phenotype and a tracing of the disease through families of affected individuals. Genetic linkage data obtained from the analysis of affected families frequently allows the determination of an approximate genomic localization of the candidate disease gene with a precision of several millions of nucleotides. Once localized, the genetically defined chromosomal region is then recovered from genomic libraries as a contiguous set of genomic fragments. Genes residing in the disease-related region are determined by analysis of transcripts that are transcribed from the genomic fragment. From this analysis an initial set of candidate genes for a particular disease are identified based on the presence of the gene product in the biological system affected by disease and a correlation between its expression pattern and the pattern of disease progression.
Important information for selection of candidate genes also comes from analysis of their homology with genes known to be part of the same or related biological system. Finally, the ultimate proof of association between a gene and a genetic disorder comes from mutational analysis of a gene in patients affected by the disorder and from demonstration of a statistical correlation between occurrence of mutation and the disease phenotype.
Although positional cloning is a powerful method for gene discovery, the experimental method is extremely tedious and expensive. Moreover, disease genes implicated in genetically complex disorders, i.e., those controlled by multiple loci, can hardly be found using this strategy because of the complications associated with multiple loci linkage analysis.
Specialized databases for homology searches have also been utilized in disease gene discovery projects. In recent years a number of efficient sequence comparison tools have been developed such as the BLAST (Basic Local Alignment Search Tool) family of programs designed for comparison of a single xe2x80x9csearch sequencexe2x80x9d with a database (see Altschul et al., 1990, J. Mol. Biol. 215:403-410; Altschul et al., 1997, Nucleic Acids Res. 25:3389-3402), the family of Hidden Markov Model methods for comparison of a set of aligned sequences that usually represent a protein motif or domain with a database (e.g., Krogh et al., 1994, J. Mol. Biol. 235:1501-1531; Grundy et al., 1997, Biochem Biophys. Res. Commun. 231:760-6) and various other comparison tools (Wu et al., 1996, Comput. Appl. Biosci 12:109-118; Neuwald et al., 1995, Protein Sci. 4:1618-1632; Neuwald, 1997, Nucleic Acids Res. 25:1665-1677).
When used in disease gene discovery projects, homology searches can be enhanced by creating specialized databases that utilize statistical analysis for evaluating significance of sequence similarities in comparison of new sequences with a database of known sequence. Such databases are fine-tuned to the size of the database used (Altschul et al., 1990, J. Mol. Biol. 215:403-410; Altschul et al., 1997, Nucleic Acids Res. 25:3389-3402), so that the same level of homology between a search sequence and a database sequence can be determined to be highly significant if the search sequence is compared with a smaller database, or insignificant and thus undetectable, if the search sequence is compared with a larger database.
In alternatives to standard homology searches, in projects oriented towards gene discovery, researchers usually have some a priori knowledge about the set of genes/proteins that might display important similarity to the unknown new gene. Therefore, selecting an a priori defined set of genes/proteins for comparison with new experimental sequences is a feasible and useful strategy. This strategy was successfully applied to search for homologs of disease genes in yeast and nematode genomes by Mushegian et al. (1997, Proc. Natl. Acad. Sci USA 94:5831-5836).
Two homologous genes taken from different species that originate from the nearest common ancestor by speciation are referred to as orthologs, while any two genes that originate from a common ancestor via a series of events involving intragenomic duplications are call paralogs. Tatusov et al. (1994, Proc. Nat.l, Acad. Sci USA 91:12091-12095) describe comparisons of proteins encoded by the genomes of different phylogenetic lineages and elucidation of consistent patterns of sequence similarities permitting the delineation of clusters of orthologous groups (COGs). Each COG consists of individual orthologous genes or orthologous groups of paralogs from different phylogenetic lineages. Since orthologs typically have the same function, the classification of known genes and proteins into clusters of orthologous groups permits the assignment of a function to a newly discovered gene or protein by merely classifying it into a COG. Although Tatusov describes a method for assigning a function to a newly discovered gene, he does not describe a method for predicting the existence of undiscovered genes. In addition, Yuan, et al. attempted simultaneous reconstruction of a species tree and identification of paralogous groups of sequences and detection of orthologs in sequence databases (Yuan et al., 1998, Bioinformatics 143:285-289).
Other groups have aimed at capturing interactions among molecules through the use of programs designed to compare structures and functions of proteins (Kazic 1994, In: Molecular Modeling: From Virtual Tools to Real Problems, Kumosinski, T. and Liebman, M. N. (Eds.), American Chemical Society, Washington, D.C. pp. 486-494; Kazic, 1994, In: New Data Challenges in Our Information Age Glaesar, P. S. and Millward, M. T. L. (Eds.). Proceedings of the Thirteenth International CODATA Secretariat, Paris pp. C133-C140; Goto et al., 1997, Pac. Symp. Biocomput. p. 175-186; Bono et al., 1998, Genome Res. 8:203-210; Selkov et al., 1996, Nucleic Acids Res.24:26-28). These projects are significantly different from the inventive methods described herein because they do not describe methods for deducing the existence of as yet unknown genes based on comparisons of regulatory pathways and gene structure between one or more species. The present invention provides a method for increasing the sensitivity of analysis methods through the generation of specialized databases.
In accordance with the present invention there is provided methods for identification of novel genes comprising (i) generating one or more specialized databases containing information on gene/protein structure, function and/or regulatory interactions; and (ii) searching the specialized databases for homology or for a particular motif and thereby identifying a putative novel gene of interest. The invention may further comprise performing simulation and hypothesis testing to identify or confirm that the putative gene is a novel gene of interest.
The invention is based, in part, on the observation that functionally similar regulatory systems are generated during evolution by genetic duplication of ancestral genes. Thus, by comparing phylogenetic trees or regulatory networks and identifying genes and/or proteins absent in one system under comparison, the existence of as yet unidentified genes and/or proteins can be predicted. To make meaningful comparisons of phylogenetic trees it is necessary to distinguish between orthologs and paralogs. The present invention provides a method useful for discriminating between orthologs and paralogs and inferring the existence of as yet unidentified genes and/or proteins.
In accordance with the present invention, specialized databases are developed that contain information on gene/protein structure and interactions based on information derived from preexisting databases and/or research articles including information on interactions among genes and proteins, their domain/motif structure and their subcellular and tissue expression/distribution patterns.
The invention relates to a sequence analysis program which utilizes the specialized database for comparison of a single sequence, processing the output into a sequence alignment, computing phylogenetic trees, and analyzing these trees to predict undiscovered genes. This program also includes a set of tools for generating motif/domain models from multiple sequence alignments of known genes and for using these models for extraction of structurally and/or functionally homologous sequences from databases which contain raw sequence data.
The invention further provides for a simulation and hypothesis testing program which relies on the specialized databases of gene/protein interactions for identifying potentially undiscovered members of multigene families through comparisons of regulatory networks for different species and testing hypotheses with regard to regulatory cascades. A comparison of homologous regulatory networks within the same organism and between different species of organisms will allow the identification of genes absent in one of the systems under comparison, thus providing a set of candidate genes. In this way, genes that contribute to the phenotype of a specific disease associated with a particular biological system under analysis may be identified, mapped and subjected to mutational analysis and functional studies.