Plant hormones, produced in response to genetic, environmental or chemical stimuli (Goldberg, Science 240: 1460-1467 (1988); Letham, In: Phytohormones and Related Compounds—A Comprehensive Treatise, eds. Letham et al., Amsterdam, Elsevier North Holland. 1: 205-263 (1978); von Sachs, Arb. Bot. Inst. Wurzburg 2:452-488 (1880), all of which are herein incorporated by reference in their entirety), play a role in controlling the growth, development and environmental responses of plants.
Cytokinins are a class of plant hormones with a structure resembling adenine. Cytokinins, in combination with auxin, promote cell division. Cytokinins are associated with many aspects of plant growth and development (Horgan, Advanced Plant Physiology, ed. Wilkins, Pitman, London: 90-116 (1984); Skoog et al., Biochemical Actions of Hormones, ed. Litwack, Academic Press, London, vol. VI: 335-413 (1979), all of which are herein incorporated by reference in their entirety). Cytokinins have been found in almost all higher plants as well as mosses, fungi, and bacteria. In addition to occurring in higher plants as free compounds, cytokinins may also occur as component nucleosides in tRNA of plants, animals, and microorganisms.
Kinetin, the first cytokinin to be discovered, was so named because of its ability to promote cytokinesis (cell division). Although kinetin is a natural compound, it is not made in plants, and is therefore usually considered a “synthetic” cytokinin. Two common forms of cytokinin in plants are zeatin and zeatin riboside (maize)(Letham, Life Sci. 2: 569-573 (1963), the entirety of which is herein incorporated by reference). More than 200 known natural and synthetic cytokinins have been reported.
Several cytokinin related mutations have also been reported. For example, the ckrl mutant of Arabidopsis is resistant to the cytokinin bezyladenine (Su and Howell, Plant Physiol. 99:1569-1574 (1992), the entirety of which is herein incorporated by reference). The Arabidopsis mutant amp1 has been reported to be a negative regulator of cytokinin biosynthesis (Chadbury et al., Plant J. 4:907-916 (1993), the entirety of which is herein incorporated by reference).
Cytokinin concentrations are highest in meristematic regions and areas of continuous growth potential such as roots, young leaves, developing fruits, and seeds (Arteca, Plant Growth Substances: Principles and Applications, eds. Chapman & Hall, New York (1996); Mauseth, Botany: An Introduction to Plant Biology, ed. Saunders, Philadelphia: 348-415 (1991); Raven et al., Biology of Plants, ed. Worth, N.Y.: 545-572 (1992); Salisbury and Ross, Plant Physiology, ed. Wadsworth, Belmont, Calif.: 357-407, 531-548 (1992), all of which are herein incorporated by reference in their entirety).
It has been reported that the induced cytokinin response varies depending on the type of cytokinin and plant species (Davies, Plant Hormones: Physiology, Biochemistry and Molecular Biology, Kluwer, Dordrecht (1995); Mauseth, Botany: An Introduction to Plant Biology, Saunders, Philadelphia: 348-415 (1991); Raven et al., Biology of Plants, ed. Worth, N.Y.: 545-572 (1992); Salisbury and Ross, Plant Physiology, ed. Wadsworth, Belmont, Calif.: 357-407, 531-548 (1992), all of which are herein incorporated by reference in their entirety). Elevated cytokinin levels are associated with the development of seeds in higher plants, and have been demonstrated to coincide with maximal mitotic activity in the endosperm of developing maize kernels, cereal grains, and fruits. Exogenous cytokinin application (via stem injection) has been shown to directly correlate with increased kernel yield in maize. In addition, plant cells transformed with the ipt gene from Agrobacterium tumefaciens showed increased growth corresponding to an increase in endogenous cytokinin levels upon induction of the enzyme. Cytokinins have been reported to confer thermotolerance in certain physiological processes such as plastid biogenesis and endosperm cell division (Cheikh and Jones, Plant Physiol. 106: 45-51 (1994); Parthier, Biochem. Physiol Pflanz 174:173-214 (1979); Jones et al., Crop Science 25: 830-834 (1985), all of which are herein incorporated by reference in their entirety).
Reviews of cytokinin metabolism, compartmentalization, conjugation and cytokinin metabolic enzymes have been presented by Jameson, Cytokinins, eds. Mok and Mok, Boca Raton, Fla., 113-128 (1994); Letham and Palni, Ann. Rev. Plant Physiol. 34: 163-197 (1983); McGaw et al. In: Biosynthesis and metabolism of plant hormones, Soc. Exp. Biol. Seminar Series, eds. Crozier and Hillman, Cambridge University Press, Cambridge, Vol. 23, chapter 5 (1984); McGaw and Horgan, Biol. Plant 27: 180 (1985); McGaw et al., In: Plant Hormones: Physiology, Biochemistry and Molecular Biology, ed. Davies, Kluwer, Dordrecht, 98-117 (1995); Mok and Martin, Cytokinins, eds. Mok and Mok, Boca Raton, Fla., 129-137 (1994); Salisbury and Ross, Plant Physiology, Belmont, Calif.: ed. Wadsworth, 357-407, 531-548 (1992), all of which are hereby incorporated by reference in their entirety.
I. Biosynthesis of Cytokinins
Cytokinins are generally found in higher concentrations in meristematic regions and growing tissues. It has been reported that cytokinins are synthesized in the roots and translocated via the xylem to the meristematic regions and growing shoots of the plant. Although cytokinin biosynthesis in developed plants takes place mainly in roots (Engelbrecht, Biochem. Physiol. Pflanzen 163: 335-343 (1972); Henson et al., J. Exp. Bot 27: 1268-1278 (1976); Sossountzov et al., Planta 175: 291-304 (1988); Van Staden et al., Ann. Bot. 42: 751-753 (1978), all of which are herein incorporated by reference in their entirety), smaller amounts can be synthesized by the shoot apex and some other plant tissues.
The level of active cytokinin at a particular site of action has been reported to be influenced by a large number of factors: de novo synthesis; oxidative degradation; reduction; formation and hydrolysis of inactive conjugates; transport into and out of particular cells; subcellular compartmentalization to or away from sites of action. It has also been reported that physiological responses may be modulated by variations in the ability of cells to respond to a particular concentration of free cytokinin.
Cytokinin biosynthesis happens through the biochemical modification of adenine (McGaw et al., In: Plant Hormones: Physiology, Biochemistry and Molecular Biology, ed. Davies, Kluwer, Dordrecht: 98-117 (1995), the entirety of which is herein incorporated by reference; Salisbury and Ross, Plant Physiology, Belmont, Calif.: ed. Wadsworth, 357-407, 531-548 (1992), the entirety of which is herein incorporated by reference). Plants appear to synthesize cytokinins either directly by addition of isopentenylpyrophosphate to AMP by an adenylate:isopentenyltransferase (cytokinin synthase) producing isopentenyladenosine 5′ phosphate (“[9R-5′P]iP”), which in turn serves as an intermediate for further modifications, or indirectly via isopentenylation of adenosine residues of tRNA by tRNA:isopentenyltransferase (McGaw et al., In: Plant Hormones: Physiology, Biochemistry and Molecular Biology, ed. Davies, Kluwer, Dordrecht: 98-117 (1995)). [9R-5′P]iP may be modified by dephosphorylation, deribosylation, hydroxylation and reduction to produce a variety of derivatives with potential activity (Binns, Annu. Rev. Plant Physiol. Plant Mol. Biol. 45: 173-196 (1994), the entirety of which is herein incorporated by reference). Further, conjugation may modulate levels of active cytokinins (Letham and Palni, Ann. Rev. Plant Physiol. 34: 163-197 (1983), the entirety of which is herein incorporated by reference).
In the biosynthesis of tRNA cytokinins, mevalonic acid pyrophosphate undergoes decarboxylation, dehydration and isomerization to yield 2-isopentyl pyrophosphate (“iPP”). iPP then condenses with the relevant adenosine residue in the tRNA to give the N6(Δ2-isopentenyl)adenosine (“[9R]iP”) moiety. With the exception of [9R]iP and to a lessor extent cis- and trans-[9R]Z, the free and tRNA cytokinins are structurally distinct (e.g., free Zeatin (“Z”) is mainly the trans isomer (trans-Zeatin while Z present in tRNA is mainly the cis isomer (McGaw et al., In: Plant Hormones: Physiology, Biochemistry and Molecular Biology, ed. Davies, Kluwer, Dordrecht, 98-117 (1995).
The de novo biosynthesis pathway of cytokinins in plants includes the following enzymes: isopentyltransferase, 5′-nucleosidase, adenine nucleotidase, adenine phosphorylase, adenine kinase, adenine phosphoribosyl transferase, microsomal mixed function oxidases, Zeatin reductase, O-glucosyltransferase, O-xylosyltransferase, β-(9-cytokinin-alanino)synthase, cytokinin oxidase, β-glucosidase, and Zeatin cis-trans isomerase.
Isopentyltransferase catalyzes the first reaction of the pathway in which N6(Δ2-isopentenyl) adenosine-5′-monophosphate (“[9R-5′P]iP”) is generated from iPP and AMP.
5′-nucleotidase catalyzes the conversion of [9R-5′P]iP to [9R]iP. The reaction catalyzed by the enzyme 5′-nucleotidase has been found in wheat germ extract (Chen et al., Plant Physiol. 67:494-498 (1981); Chen et al., Plant Physiol. 68:1020-1023 (1981), both of which are herein incorporated by reference in their entirety) and in tomato leaf and root extracts (Burch and Stuchbury, Phytochemistry 25:2445-2449 (1986); Burch and Stuchbury, J. Plant Physiol. 125:267-273 (1986), both of which are herein incorporated by reference in their entirety). Adenine kinase catalyzes the reversion of [9R]iP to [9R-5′P]iP. Alternatively, [9R-5′P]iP can be converted to t-Zeatin riboside-5′-monophosphate (“[9R-5′P]Z”) by a microsomal mixed function oxidase.
Adenosine nucleotidase catalyzes the conversion of [9R]iP to iP. This reaction can be reversed by the enzyme adenine phosphorylase. Alternatively, [9R]iP can be converted to t-Zeatin riboside (“[9R]Z”) by a microsomal mixed function oxidase. Under another reaction mechanism, adenosine can be cleaved from [9R]iP by cytokinin oxidase. The enzyme adenine phosphoribosyl transferase can catalyze the conversion of iP to [9R-5′P]iP. Adenine phosphoribosyl transferase which is one of the salvage routes in plants for converting adenosine to AMP has also been shown to catalyze the phosphoribolyzation of cytokinin bases from a number of plant sources, including wheat germ (Chen et al., Arch. Biochem. Biophys. 214:634-641 (1982), the entirety of which is herein incorporated by reference), tomato (Burch et al., Physiol. Plant 69:283-288 (1987), the entirety of which is herein incorporated by reference), A. thaliana (Moffatt et al., Plant Physiol 95:900-908 (1991), the entirety of which is herein incorporated by reference) and Acer psudoplatanus (Doree and Guern, Biochem. Biophys. Acta 304:611-622 (1973); Sadorge et al., Physiol. Veg. 8:499-514 (1970), both of which are herein incorporated by reference in their entirety).
The cytokinins N6(Δ2-isopentenyl) adenosine-7-glucoside (“[7G]iP”) and N6(Δ2-isopentenyl) adenosine-9-glucoside (“[9G]iP”) are generated from iP from the enzymes Zeatin reductase and O-glucosyltransferase (such as cytokinin-9-glucosyl transferase), respectively. Under another reaction mechanism, adenine can be cleaved from iP by cytokinin oxidase.
In addition to converting [9R-5′P]iP to [9R]iP, 5′-nucleotidase can also catalyze the conversion of [9R-5′P]Z to [9R]Z. Adenine kinase can catalyze the conversion of [9R]Z to [9R-5′P]Z.
O-glucosyltransferase catalyzes the conversion of [9R]Z to t-Zeatin riboside-O-glucoside (“(OG)[9R]Z”). O-glucosyltransferase can also remove the glucoside group from (OG)[9R]Z to regenerate [9R]Z. Adenosine can be cleaved from [9R]Z by cytokinin oxidase. Alternatively, adenine nucleotidase can convert [9R]Z to Z. Adenine phosphorylase can catalyze the conversion of Z back into [9R]Z.
The cytokinins dihidroZeatin (“(diH)Z”), Zeatin-7-glucoside ([7G]Z), Zeatin-9-glucoside (“[9G]Z”), and lupinic acid (“[9Ala]Z”) are generated from Z by the enzymes Zeatin reductase, O-glucosyltansferase, Zeatin reductase and β-(9-cytokinin alanino) synthase, respectively. Zeatin cis-trans isomerase catalyzes the isomerization of Zeatin between its cis and trans isomers. O-glucosyltransferase catalyzes the addition of a glucoside residue to Z to form t-Zeatin-O-glucoside (“(OG)Z”) or removal of a glucoside residue from (OG)Z to form Z.
The cytokinins dihydroZeatin-9-glucoside (“(diH)[9G]Z”), dihydroZeatin-7-glucoside (“(diH)[7G]Z”), and dihydrolupinic acid (“(diH)[9Ala]Z”) are generated from (diH)Z by the enzymes β-(9-cytokinin alanino)synthase, Zeatin reductase, and O-glucosyltansferase, respectively. O-glucosyltransferase catalyzes the addition of a glucoside residue to (diH)Z to form t-Zeatin-O-glucoside (“(diHOG)Z”) or removal of a glucoside residue from (diHOG)Z to form (diH)Z. Alternatively, (diH)Z can be converted into dihydroZeatin riboside ((diH)[9R]Z) by adenine phosphorylase. The enzyme adenine nucleotidase can catalyze the conversion of (diH)[9R]Z to (diH)Z.
O-glucosyltransferase catalyzes the addition of a glucoside residue to (diH)[9R]Z to form t-dihydroZeatin riboside-O-glucoside (“(diHOG)[9R]Z”) or the removal of a glucoside residue from (diHOG)[9R]Z to form (diH)[9R]Z. The cytokinin dihydroZeatin riboside-5′-monophosphate (“(diH)[9R-5′P]Z”) is generated from (diH)[9R]Z by the enzyme adenine kinase. This reaction can be reversed by the enzyme 5′-nucleotidase.
It is understood that the above description of the de novo biosynthesis of cytokinins only describes the core of the biosynthesis pathway. Other enzymes have been reported to be involved in this pathway.
Active cytokinins can be inactivated by degradation or conjugation to different low-molecular-weight metabolites, such as sugars and amino acids. The enzyme cytokinin oxidase plays a role in the degradation of cytokinins. This enzyme removes the side chain and releases adenine, the backbone of all cytokinins. Cytokinin oxidases are reported to remove cytokinins from plant cells after cell division. Cytokinin derivatives are also made.
β-glucosidase (EC 3.2.1.21) has been reported to cleave the biologically inactive hormone conjugates of cytokinin-O-glucoside to release the active cytokinin (Brzobohaty et al., Science 262:1051-1054 (1993); Campos et al., Plant J. 2:675-684 (1992), both of which are herein incorporated by reference in their entirety). β-glucosidase catalyzes the hydrolysis of aryl and alkyl β-D-glucosides and/or cellobiose with the release of β-D-glucose (Reese, Recent Adv. Phytochem. 11:311 (1977), the entirety of which is herein incorporated by reference). The enzyme has been purified from maize and has a molecular weight of 60 kD (Esen, Plant Physiol. 98:174-182 (1992); Esen et al., Biochem. Genet. 28:319-336 (1990), both of which are herein incorporated by reference). Esen et al. have identified the rolC gene of Agrobacterium rhizogenes which encodes for a cytokinin β-glucosidase and which effects the growth and development of transgenic plants (Esen et al., EMBO J. 10:2889-2895 (1991), the entirety of which is herein incorporated by reference).
Conjugation is often reported as a way of removing free and active hormones from a tissue. The conjugation process is often reversible, and, as conjugates can frequently accumulate in excess of free forms of phytohormone. The conjugate pools are also considered as sources of free hormone and may represent storage or inactive transportable forms of the hormone.
II. Expressed Sequence Tag Nucleic Acid Molecules
Expressed sequence tags, or ESTs are randomly sequenced members of a cDNA library (or complementary DNA)(McCombie et al., Nature Genetics 1:124-130 (1992); Kurata et al., Nature Genetics 8:365-372 (1994); Okubo et al., Nature Genetics 2:173-179 (1992), all of which references are incorporated herein in their entirety). The randomly selected clones comprise insets that can represent a copy of up to the full length of a mRNA transcript.
Using conventional methodologies, cDNA libraries can be constructed from the mRNA (messenger RNA) of a given tissue or organism using poly dT primers and reverse transcriptase (Efstratiadis et al., Cell 7:279-3680 (1976), the entirety of which is herein incorporated by reference; Higuchi et al., Proc. Natl. Acad. Sci. (U.S.A.) 73:3146-3150 (1976), the entirety of which is herein incorporated by reference; Maniatis et al., Cell 8:163-182 (1976) the entirety of which is herein incorporated by reference; Land et al., Nucleic Acids Res. 9:2251-2266 (1981), the entirety of which is herein incorporated by reference; Okayama et al., Mol. Cell. Biol. 2:161-170 (1982), the entirety of which is herein incorporated by reference; Gubler et al., Gene 25:263-269 (1983), the entirety of which is herein incorporated by reference).
Several methods may be employed to obtain full-length cDNA constructs. For example, terminal transferase can be used to add homopolymeric tails of dC residues to the free 3′ hydroxyl groups (Land et al., Nucleic Acids Res. 9:2251-2266 (1981), the entirety of which is herein incorporated by reference). This tail can then be hybridized by a poly dG oligo which can act as a primer for the synthesis of full length second strand cDNA. Okayama and Berg, Mol. Cell. Biol. 2:161-170 (1982), the entirety of which is herein incorporated by reference, report a method for obtaining full length cDNA constructs. This method has been simplified by using synthetic primer-adapters that have both homopolymeric tails for priming the synthesis of the first and second strands and restriction sites for cloning into plasmids (Coleclough et al., Gene 34:305-314 (1985), the entirety of which is herein incorporated by reference) and bacteriophage vectors (Krawinkel et al., Nucleic Acids Res. 14:1913 (1986), the entirety of which is herein incorporated by reference; Han et al., Nucleic Acids Res. 15:6304 (1987), the entirety of which is herein incorporated by reference).
These strategies have been coupled with additional strategies for isolating rare mRNA populations. For example, a typical mammalian cell contains between 10,000 and 30,000 different mRNA sequences (Davidson, Gene Activity in Early Development, 2nd ed., Academic Press, New York (1976), the entirety of which is herein incorporated by reference). The number of clones required to achieve a given probability that a low-abundance mRNA will be present in a cDNA library is N=(ln(1−P))/(ln(1−1/n)) where N is the number of clones required, P is the probability desired and 1/n is the fractional proportion of the total mRNA that is represented by a single rare mRNA (Sambrook et al., Molecular Cloning: A Laboratory Manual, 2nd ed., Cold Spring Harbor Laboratory Press (1989), the entirety of which is herein incorporated by reference).
A method to enrich preparations of mRNA for sequences of interest is to fractionate by size. One such method is to fractionate by electrophoresis through an agarose gel (Pennica et al., Nature 301:214-221 (1983), the entirety of which is herein incorporated by reference). Another such method employs sucrose gradient centrifugation in the presence of an agent, such as methylmercuric hydroxide, that denatures secondary structure in RNA (Schweinfest et al., Proc. Natl. Acad. Sci. (U.S.A.) 79:4997-5000 (1982), the entirety of which is herein incorporated by reference).
A frequently adopted method is to construct equalized or normalized cDNA libraries (Ko, Nucleic Acids Res. 18:5705-5711 (1990), the entirety of which is herein incorporated by reference; Patanjali et al., Proc. Natl. Acad. Sci. (U.S.A.) 88:1943-1947 (1991), the entirety of which is herein incorporated by reference). Typically, the cDNA population is normalized by subtractive hybridization (Schmid et al., J. Neurochem. 48:307-312 (1987), the entirety of which is herein incorporated by reference; Fargnoli et al., Anal. Biochem. 187:364-373 (1990), the entirety of which is herein incorporated by reference; Travis et al., Proc. Natl. Acad. Sci. (U.S.A.) 85:1696-1700 (1988), the entirety of which is herein incorporated by reference; Kato, Eur. J. Neurosci. 2:704-711 (1990); and Schweinfest et al., Genet. Anal. Tech. Appl. 7:64-70 (1990), the entirety of which is herein incorporated by reference). Subtraction represents another method for reducing the population of certain sequences in the cDNA library (Swaroop et al., Nucleic Acids Res. 19:1954 (1991), the entirety of which is herein incorporated by reference).
ESTs can be sequenced by a number of methods. Two basic methods may be used for DNA sequencing, the chain termination method of Sanger et al., Proc. Natl. Acad. Sci. (U.S.A.) 74:5463-5467 (1977), the entirety of which is herein incorporated by reference and the chemical degradation method of Maxam and Gilbert, Proc. Nat. Acad. Sci. (U.S.A.) 74:560-564 (1977), the entirety of which is herein incorporated by reference. Automation and advances in technology such as the replacement of radioisotopes with fluorescence-based sequencing have reduced the effort required to sequence DNA (Craxton, Methods 2:20-26 (1991), the entirety of which is herein incorporated by reference; Ju et al., Proc. Natl. Acad. Sci. (U.S.A.) 92:4347-4351 (1995), the entirety of which is herein incorporated by reference; Tabor and Richardson, Proc. Natl. Acad. Sci. (U.S.A.) 92:6339-6343 (1995), the entirety of which is herein incorporated by reference). Automated sequencers are available from, for example, Pharmacia Biotech, Inc., Piscataway, N.J. (Pharmacia ALF), LI-COR, Inc., Lincoln, Nebr. (LI-COR 4,000) and Millipore, Bedford, Mass. (Millipore BaseStation).
In addition, advances in capillary gel electrophoresis have also reduced the effort required to sequence DNA and such advances provide a rapid high resolution approach for sequencing DNA samples (Swerdlow and Gesteland, Nucleic Acids Res. 18:1415-1419 (1990); Smith, Nature 349:812-813 (1991); Luckey et al., Methods Enzymol. 218:154-172 (1993); Lu et al., J. Chromatog. A. 680:497-501 (1994); Carson et al., Anal. Chem. 65:3219-3226 (1993); Huang et al., Anal. Chem. 64:2149-2154 (1992); Kheterpal et al., Electrophoresis 17:1852-1859 (1996); Quesada and Zhang, Electrophoresis 17:1841-1851 (1996); Baba, Yakugaku Zasshi 117:265-281 (1997), all of which are herein incorporated by reference in their entirety).
ESTs longer than 150 nucleotides have been found to be useful for similarity searches and mapping (Adams et al., Science 252:1651-1656 (1991), herein incorporated by reference). ESTs, which can represent copies of up to the full length transcript, may be partially or completely sequenced. Between 150-450 nucleotides of sequence information is usually generated as this is the length of sequence information that is routinely and reliably produced using single run sequence data. Typically, only single run sequence data is obtained from the cDNA library (Adams et al., Science 252:1651-1656 (1991). Automated single run sequencing typically results in an approximately 2-3% error or base ambiguity rate (Boguski et al., Nature Genetics 4:332-333 (1993), the entirety of which is herein incorporated by reference).
EST databases have been constructed or partially constructed from, for example, C. elegans (McCombrie et al., Nature Genetics 1:124-131 (1992)), human liver cell line HepG2 (Okubo et al., Nature Genetics 2:173-179 (1992)), human brain RNA (Adams et al., Science 252:1651-1656 (1991); Adams et al., Nature 355:632-635 (1992)), Arabidopsis, (Newman et al., Plant Physiol. 106:1241-1255 (1994)); and rice (Kurata et al., Nature Genetics 8:365-372 (1994)).
III. Sequence Comparisons
A characteristic feature of a DNA sequence is that it can be compared with other DNA sequences. Sequence comparisons can be undertaken by determining the similarity of the test or query sequence with sequences in publicly available or proprietary databases (“similarity analysis”) or by searching for certain motifs (“intrinsic sequence analysis”)(e.g. cis elements)(Coulson, Trends in Biotechnology 12:76-80 (1994), the entirety of which is herein incorporated by reference); Birren et al., Genome Analysis 1: Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 543-559 (1997), the entirety of which is herein incorporated by reference).
Similarity analysis includes database search and alignment. Examples of public databases include the DNA Database of Japan (DDBJ) (www-ddbj.nig.ac.jp/); Genebank (www-ncbi.nlm.nih.gov/Web/Search/Index.html); and the European Molecular Biology Laboratory Nucleic Acid Sequence Database (EMBL) (www-ebi.ac.uk/ebi_docs/embl_db/embl_db.html). Other appropriate databases include dbEST (www-ncbi.nlm.nih.gov/ dbEST/index.html), SwissProt (www-ebi.ac.uk/ebi_docs/swisprot13db/swisshome.html), PIR (www-nbrt.georgetown.edu/pir/) and The Institute for Genome Research (www-tigr. org/tdb/tdb.html).
A number of different search algorithms have been developed, one example of which are the suite of programs referred to as BLAST programs. There are five implementations of BLAST, three designed for nucleotide sequences queries (BLASTN, BLASTX and TBLASTX) and two designed for protein sequence queries (BLASTP and TBLASTN) (Coulson, Trends in Biotechnology 12:76-80 (1994); Birren et al., Genome Analysis 1, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y. 543-559 (1997)).
BLASTN takes a nucleotide sequence (the query sequence) and its reverse complement and searches them against a nucleotide sequence database. BLASTN was designed for speed, not maximum sensitivity and may not find distantly related coding sequences. BLASTX takes a nucleotide sequence, translates it in three forward reading frames and three reverse complement reading frames and then compares the six translations against a protein sequence database. BLASTX is useful for sensitive analysis of preliminary (single-pass) sequence data and is tolerant of sequencing errors (Gish and States, Nature Genetics 3:266-272 (1993), the entirety of which is herein incorporated by reference). BLASTN and BLASTX may be used in concert for analyzing EST data (Coulson, Trends in Biotechnology 12:76-80 (1994); Birren et al., Genome Analysis 1:543-559 (1997)).
Given a coding nucleotide sequence and the protein it encodes, it is often preferable to use the protein as the query sequence to search a database because of the greatly increased sensitivity to detect more subtle relationships. This is due to the larger alphabet of proteins (20 amino acids) compared with the alphabet of nucleic acid sequences (4 bases), where it is far easier to obtain a match by chance. In addition, with nucleotide alignments, only a match (positive score) or a mismatch (negative score) is obtained, but with proteins, the presence of conservative amino acid substitutions can be taken into account. Here, a mismatch may yield a positive score if the non-identical residue has physical/chemical properties similar to the one it replaced. Various scoring matrices are used to supply the substitution scores of all possible amino acid pairs. A general purpose scoring system is the BLOSUM62 matrix (Henikoff and Henikoff, Proteins 17:49-61 (1993), the entirety of which is herein incorporated by reference), which is currently the default choice for BLAST programs. BLOSUM62 is tailored for alignments of moderately diverged sequences and thus may not yield the best results under all conditions. Altschul, J. Mol. Biol. 36:290-300 (1993), the entirety of which is herein incorporated by reference, describes a combination of three matrices to cover all contingencies. This may improve sensitivity, but at the expense of slower searches. In practice, a single BLOSUM62 matrix is often used but others (PAM40 and PAM250) may be attempted when additional analysis is necessary. Low PAM matrices are directed at detecting very strong but localized sequence similarities, whereas high PAM matrices are directed at detecting long but weak alignments between very distantly related sequences.
Homologues in other organisms are available that can be used for comparative sequence analysis. Multiple alignments are performed to study similarities and differences in a group of related sequences. CLUSTAL W is a multiple sequence alignment package available that performs progressive multiple sequence alignments based on the method of Feng and Doolittle, J. Mol. Evol. 25: 351-360 (1987), the entirety of which is herein incorporated by reference. Each pair of sequences is aligned and the distance between each pair is calculated; from this distance matrix, a guide tree is calculated, and all of the sequences are progressively aligned based on this tree. A feature of the program is its sensitivity to the effect of gaps on the alignment; gap penalties are varied to encourage the insertion of gaps in probable loop regions instead of in the middle of structured regions. Users can specify gap penalties, choose between a number of scoring matricies, or supply their own scoring matrix for both the pairwise alignments and the multiple alignments. CLUSTAL W for UNIX and VMS systems is available by ftp at: ebi.ac.uk. Another program is MACAW (Schuler et al., Proteins, Stuct. Func. Genet, 9:180-190 (1991), the entirety of which is herein incorporated by reference), for which both Macintosh and Microsoft Windows versions are available. MACAW uses a graphical interface, provides a choice of several alignment algorithms, and is available by anonymous ftp at: ncbi.nlm.nih.gov (directory/pub/macaw).
Sequence motifs are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone. Currently, the largest collection of sequence motifs in the world is PROSITE (Bairoch and Bucher, Nucleic Acid Research 22:3583-3589 (1994), the entirety of which is herein incorporated by reference). PROSITE may be accessed via either the ExPASy server on the World Wide Web or anonymous ftp site. Many commercial sequence analysis packages also provide search programs that use PROSITE data.
A resource for searching protein motifs is the BLOCKS E-mail server developed by Henikoff, Trends Biochem Sci. 18:267-268 (1993), the entirety of which is herein incorporated by reference; Henikoff and Henikoff, Nucleic Acid Research 19:6565-6572 (1991), the entirety of which is herein incorporated by reference; Henikoff and Henikoff, Proteins 17:49-61 (1993). BLOCKS searches a protein or nucleotide sequence against a database of protein motifs or “blocks.” Blocks are defined as short, ungapped multiple alignments that represent highly conserved protein patterns. The blocks themselves are derived from entries in PROSITE as well as other sources. Either a protein query or a nucleotide query can be submitted to the BLOCKS server; if a nucleotide sequence is submitted, the sequence is translated in all six reading frames and motifs are sought for these conceptual translations. Once the search is completed, the server will return a ranked list of significant matches, along with an alignment of the query sequence to the matched BLOCKS entries.
Conserved protein domains can be represented by two-dimensional matrices, which measure either the frequency or probability of the occurrences of each amino acid residue and deletions or insertions in each position of the domain. This type of model, when used to search against protein databases, is sensitive and usually yields more accurate results than simple motif searches. Two popular implementations of this approach are profile searches such as GCG program ProfileSearch and Hidden Markov Models (HMMs)(Krough et al., J. Mol. Biol. 235:1501-1531, (1994); Eddy, Current Opinion in Structural Biology 6:361-365, (1996), both of which are herein incorporated by reference in their entirety). In both cases, a large number of common protein domains have been converted into profiles, as present in the PROSITE library, or HHM models, as in the Pfam protein domain library (Sonnhammer et al., Proteins 28:405-420 (1997), the entirety of which is herein incorporated by reference). Pfam contains more than 500 HMM models for enzymes, transcription factors, signal transduction molecules and structural proteins. Protein databases can be queried with these profiles or HMM models, which will identify proteins containing the domain of interest. For example, HMMSW or HMMFS, two programs in a public domain package called HMMER (Sonnhammer et al., Proteins 28:405-420 (1997)) can be used.
PROSITE and BLOCKS represent collected families of protein motifs. Thus, searching these databases entails submitting a single sequence to determine whether or not that sequence is similar to the members of an established family. Programs working in the opposite direction compare a collection of sequences with individual entries in the protein databases. An example of such a program is the Motif Search Tool, or MoST (Tatusov et al., Proc. Natl. Acad. Sci. (U.S.A.) 91:12091-12095 (1994), the entirety of which is herein incorporated by reference). On the basis of an aligned set of input sequences, a weight matrix is calculated by using one of four methods (selected by the user). A weight matrix is simply a representation, position by position of how likely a particular amino acid will appear. The calculated weight matrix is then used to search the databases. To increase sensitivity, newly found sequences are added to the original data set, the weight matrix is recalculated and the search is performed again. This procedure continues until no new sequences are found.