Gene sequence and DNA sequence data bases are known. For example see "GenBank as a research tool--a computer based database for use in DNA sequencing (conference abstract)". Burks et al., J. Cell. Biochem. (Suppl.9B, 153) 1985. This publication discloses that Genbank, the national nucleotide sequence database, is a computer-based data bank of all published DNA and RNA sequences. The database is available on-line on tape and in hardcopy book form. As of September, 1984, the database contained close to 3.5 million nucleotides in over 4,000 entries. In addition to being a convenient reference for researchers interested in individual sequences, the database has been and is being designed to anticipate strategies that scan over many entries in search of similar features. Algorithms and software to accomplish these searches have been developed. Sequence comparison algorithms (used for identifying local homology, consensus sequences and hairpin structures), prediction of protein coding regions, and correlation of primary sequence data with both secondary and tertiary structure and functional roles in the cell are discussed.
U.S. Pat. No. 4,675,283 to Roninson discloses a method for detecting and isolating DNA sequences commonly held by different DNA preparations or repeated or amplified within a complex genome. When two different DNA preparations are hybridized to each other using the methodology of Roninson, DNA fragments of identical electrophoretic mobility and having homologous sequences can be detected.
U.S. Pat. No. 4,820,630 to Taub discloses a method of prenatal diagnosis of sickle cell anemia using a polymorphic HpaI site not located in the beta-globin gene itself, but rather in an adjacent sequence. This method of analysis is indirect and suitable only in those cases where the parents at risk can be shown to have the appropriate linked polymorphism prior to amniocentesis. Column 4, lines 1-6 indicates that the invention of Taub is applicable to genetic disorders for which the locus of the lesion, and either the normal or mutated sequence about the lesion, are known or isolatable. In one embodiment of the method of Taub, a first labeled probe, complementary to a 5' region of a site of interest, and a second labeled probe, complementary to a 3' region of the site of interest are utilized. If the site of interest constitutes a recognition/restriction site for an enzyme, digestion of the sampled DNA with that enzyme in hybridization of the restriction fragments to the labeled probes will separate the two labels and thus hinder their interaction to produce a signal.
U.S. Pat. No. 5,242,794 to Whiteley et al. discloses a method for diagnosis of genetic abnormalities or other genetic conditions which can be readily automated. The method is used to determine the presence or absence of a target sequence in a sample of denatured nucleic acid and entails hybridizing the sample with a probe complementary to a diagnostic portion of the target sequence, and with a probe complementary to a nucleotide sequence contiguous with the diagnostic portion under conditions wherein the diagnostic probe remains bound substantially only to the sample nucleic acid containing the target sequence. This patent relates to the detection of specific sequences of nucleotides in a variety of nucleic acid samples, and more particularly to those which contain a sequence characterized by a difference in a single base pair from a standard sequence.
U.S. Pat. No. 4,963,477 to Tchen issued Oct. 16, 1990. This patent is related to kit for detecting the presence of a nucleic acid sequence, such as a gene or a gene fragment, in a composition or a specimen suspected to contain that gene. The kit comprises a probe containing a nucleic acid complementary with the nucleic acid sequence or gene which is sought to be detected.
U.S. Pat. No. 4,925,785 to Wang et al. is directed to a method for carrying out a nucleic acid hybridization test to detect a target nucleic acid sequence. The method uses a single-stranded nucleic acid sequence capable of hybridizing to the target nucleic acid sequence and contacting a denatured target nucleic acid sequence with the single-stranded nucleic acid sequence to detect complexes between the target nucleic acid sequence and the first single-stranded nucleic acid sequence which is bonded to a polymer.
U.S. Pat. No. 5,208,144 to Smith et al. is directed to a method for the detection of human DNA which contains the gene encoding a low density lipoprotein receptor. The method involves three steps of isolating DNA that contains the gene encoding the low density lipoprotein receptor, contacting the DNA with DNA encoding gp330 for a time under conditions sufficient for hybridization to occur, and detecting the hybridization of the DNA with the gene encoding the gp330 DNA, wherein the presence of hybridization indicates the presence of DNA which contains the gene encoding the low density lipoprotein receptor.
U.S. Pat. No. 4,863,857 to Blalock et al. discloses polypeptides complementary to peptides or proteins having an amino acid sequence or nucleotide coding sequence at least partially known. This patent allows for a determination of the structure of polypeptides having particular structural and biological activities and affinities. The patent does not disclose a step of data base scanning to identify genes of unknown function.
U.S. Pat. No. 5,077,195 to Blalock et al. relates to polypeptides complementary to peptides or proteins having an amino acid sequence or nucleotide coding sequence at least partially known. This patent allows for the production of polypeptides which are complementary to known proteinaceous hormones. The polypeptides are capable of binding to the hormones and can be utilized to render the complementary hormone inactive. The patent does not disclose a step of data base scanning to identify genes of unknown function.
The Biotechnology News, dated Apr. 8, 1994, discloses that hereditary, non-polyposis colorectal cancer genes can be detected by scanning the Human Genome Sciences (HGS) data base for human genes similar to a known bacterial gene containing a similar sequence to the non-polyposis colorectal cancer gene.
Science, Volume 263, Mar. 18, 1994, pages 1559-1560, discloses the use of MutL bacterial probes containing the non-polyposis colon cancer gene of mice to isolate the gene in humans. The human DNA data base of Human Genome Sciences was scanned and the human gene associated with the colon cancer was located.
"Identifying potential tRNA genes in genomic DNA sequences", Fichant et al., (J. Mol. Biol. (ENGLAND) Aug. 5, 1991, 220 (3) pp. 659-71) discloses an algorithm that automatically and reproducibly identifies potential tRNA genes in genomic DNA sequences, and present a general strategy for testing the sensitivity of such algorithms. This algorithm is useful for the flagging and characterization of long genomic sequences that have not been experimentally analyzed for identification of functional regions, and for the scanning of nucleotide sequence databases for errors in the sequences and the functional assignments associated with them.
"Genetic and molecular analyses of the C-terminal region of the recE gene from the rac prophage of Escherichia coli K-12 reveal the recT gene". Clark et al., J. Bacteriol (UNITED STATES) December 1993, 175 (23) p. 7673-82A. A computer-performed scan of the bacteriophage nucleotide sequence data base of GenBank revealed substantial similarity between most of recE and a 2.5-kb portion of the b2 region of lambda. This suggests an evolutionary relationship of lambda and Rac prophages.
"Identification of an active gene by using large-scale cDNA sequencing". Itoh et al. Gene (NETHERLANDS) Mar. 25, 1994, 140 (2) p 295-6. A 3'-directed partial cDNA clone that matches exactly a genomic sequence in GenBank was isolated while collecting transcribed sequences from adult lung by a random approach. This is a report of active gene identification on genomic sequence without the aid of Northern hybridization.
"Characterization of the DNF15S2 locus on human chromosome 3: identification of a gene coding for four kringle domains with homology to hepatocyte growth factor". Han et al., Biochemistry (UNITED STATES) Oct. 8, 1991, 30 (40) p. 9768-80, discloses that the DNA sequence of the gene and cDNA and its translated amino acid sequence were compared against GenBank and NBRF databases. Sequences homologous to DNF15S1 and DNF15S2, human DNF15S2 lung mRNA, and rat acyl-peptide hydrolase were identified in exon 17 to the 3' end of the characterized sequence for the gene. From the results, it was apparent that the gene coding for human HGF-like protein is located at the DNF15S2 locus on human chromosome 3 (3p21). The gene for acyl-peptide hydrolase is 444 bp downstream of the gene coding for HGF-like protein, but on the complementary strand. The DNF15S2 locus has been proposed to code for one or more tumor suppressor genes since this locus is deleted in DNA from small cell lung carcinoma, other lung cancers, renal cell carcinoma, and von Hippel-Lindau syndrome.
"Statistical analysis of nucleotide sequences--DNA sequence, RNA sequence database scanning". Stueckle et al., Nucleic Acids Res. (18, 22, 6641-47) 1990. In order to scan nucleic acid databases for potentially relevant but as yet unknown signals, an improved statistical model for pattern analysis of nucleic acid sequences was developed by modifying previous methods based on Markov chains. The importance of selecting the appropriate parameters in order for the method to function is demonstrated. The method allows the simultaneous analysis of several short sequences with unequal base frequencies and Markov order k not equal to 0 as is usually the case in databases. As a test of these modifications, it was demonstrated that in 797 Escherichia coli sequences (total length 1.2 million bases), stored in the GenBank database, there was a bias against palindromic hexamers which correspond to known restriction enzyme recognition sites. Correct choice of Markov order k and threshold value alpha was essential for obtaining correct results. For oligonucleotides of length (L) greater than 4 the value of k should be 2. For values of L less than or equal to 4, k should be L-2.
The prior work in this field has failed to provide a simple, accurate method for determining the function of unknown genes.
The present method provides a means of scanning data banks consisting of cloned genetic material, including but not limited to DNA, RNA, mRNA, tRNA and nucleotide fragments, to identify the function of genetic material of unknown function. The method is simple, accurate and rapid.