The present invention relates generally to the field of plant gums and other hydroxyproline-rich glycoproteins, and in particular, to the expression of synthetic genes designed from repetitive peptide sequences.
Gummosis is a common wound response that results in the exudation of a gum sealant at the site of cracks in bark. A. M. Stephen et al., xe2x80x9cExudate Gumsxe2x80x9d, Methods Plant Biochem. (1990). Generally the exudate is a composite of polysaccharides and glycoproteins structurally related to cell wall components such as galactans [G. O. Aspinall, xe2x80x9cPlant Gumsxe2x80x9d, The Carbohydrates 2B:522536 (1970)] and hydroxyproline-rich glycoproteins [Anderson and McDougall, xe2x80x9cThe chemical characterization of the gum exudates from eight Australian Acacia species of the series Phyllodineae.xe2x80x9d Food Hydrocolloids, 2: 329 (1988)].
Gum arabic is probably the best characterized of these exudates (although it has been largely refractory to chemical analysis). It is a natural plant exudate secreted by various species of Acacia trees. Acacia senegal accounts for approximately 80% of the production of gum arabic with Acacia seyal, Acacia laeta, Acacia camplylacantha, and Acacia drepanolobium supplying the remaining 20%. The gum is gathered by hand in Africa. It is a tedious process involving piercing and stripping the bark of the trees, then returning later to gather the dried tear drop shaped, spherical balls that form in response to mechanical wounding.
The exact chemical nature of gum arabic has not been elucidated. It is believed to consist of two major components, a microheterogeneous glucurono-arabinorhamnogalactan polysaccharide and a higher molecular weight hydroxyproline-rich glycoprotein. Osman et al., xe2x80x9cCharacterization of Gum Arabic Fractions Obtained By Anion-Exchange Chromatographyxe2x80x9d Phytochemistry 38:409 (1984) and Qi et al., xe2x80x9cGum Arabic Glycoprotein Is A Twisted Hairy Ropexe2x80x9d Plant Physiol. 96:848 (1991). While the amino composition of the protein portion has been examined, little is known with regard to the precise amino acid sequence.
While the precise chemical nature of gum arabic is elusive, the gum is nonetheless particularly useful due to its high solubility and low viscosity compared to other gums. The FDA declared the gum to be a GRAS food additive. Consequently, it is widely used in the food industry as a thickener, emulsifier, stabilizer, surfactant, protective colloid, and flavor fixative or preservative. J. Dziezak, xe2x80x9cA Focus on Gumsxe2x80x9d Food Technology (March 1991). It is also used extensively in the cosmetics industry.
Normally, the world production of gum arabic is over 100,000 tons per year. However, this production depends on the environmental and political stability of the region producing the gum. In the early 1970s, for example, a severe drought reduced gum production to 30,00 tons. Again in 1985, drought brought about shortages of the gum, resulting in a 600% price increase.
Three approaches have been used to deal with the somewhat precarious supply problem of gum arabic. First, other gums have been sought out in other regions of the world. Second, additives have been investigated to supplement inferior gum arabic. Third, production has been investigated in cultured cells.
The effort to find other gums in other regions of the world has met with some limited success. However, the solubility of gum arabic from Acacia is superior to other gums because it dissolves well in either hot or cold water. Moreover, while other exudates are limited to a 5% solution because of their excessive viscosity, gum arabic can be dissolved readily to make 55% solutions.
Some additives have been identified to supplement gum arabic. For example, whey proteins can be used to increase the functionality of gum arabic. A. Prakash et al., xe2x80x9cThe effects of added proteins on the functionality of gum arabic in soft drink emulsion systems,xe2x80x9d Food Hydrocolloids 4:177 (1990). However, this approach has limitations. Only low concentrations of such additives can be used without producing off-flavors in the final food product.
Attempts to produce gum arabic in cultured Acacia senegal cells has been explored. Unfortunately, conditions have not been found which lead to the expression of gum arabic in culture. A. Mollard and J-P. Joseleau, xe2x80x9cAcacia senegal cells cultured in suspension secrete a hydroxyproline-deficient arabinogalactan-proteinxe2x80x9d Plant Physiol. Biochem. 32:703 (1994).
Clearly, new approaches to improve gum arabic production are needed. Such approaches should not be dependent on environmental or political factors. Ideally, such approaches should simplify production and be relatively inexpensive.
The present invention involves a new approach in the field of plant gums and presents a new solution to the production of hydroxyproline(Hyp)-rich glycoproteins (HRGPs), repetitive proline-rich proteins (RPRPs) and arabinogalactan-proteins (AGPs). The present invention contemplates the expression of synthetic genes designed from repetitive peptide sequences of such glycoproteins, including the peptide sequences of gum arabic glycoprotein (GAGP).
With respect to GAGP, the present invention contemplates a substantially purified polypeptide comprising at least a portion of the amino acid sequence Ser-Hyp-Hyp-Hyp-[Hyp/Thr]-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:1 and SEQ ID NO:2) or variants thereof. By xe2x80x9cvariantsxe2x80x9d it is meant that the sequence need not comprise the exact sequence; up to five (5) amino acid substitutions are contemplated. For example, a Leu or Hyp may be substituted for the Gly; Leu may also be substituted for Ser and one or more Hyp. By xe2x80x9cvariantsxe2x80x9d it is also meant that the sequence need not be the entire nineteen (19) amino acids. Illustrative variants are shown in Table 3. In one preferred embodiment, variants contain one or more of the following three motifs: Ser-Hyp4, Ser-Hyp3-Thr, and Xaa-Hyp-Xaa-Hyp, where Xaa is any amino acid other than hydroxyproline.
Indeed, it is not intended that the present invention be limited by the precise length of the purified polypeptide. In one embodiment, the peptide comprises more than twelve (12) amino acids from the nineteen (19) amino acids of the sequence. In another embodiment, a portion of the nineteen (19) amino acids (see SEQ ID NO:1 and SEQ ID NO:2) is utilized as a repetitive sequence. In yet another embodiment, all nineteen (19) amino acids (see SEQ ID NO:1 and SEQ ID NO:2) with or without amino acid substitutions) are utilized as a repetitive sequence.
It is not intended that the present invention be limited by the precise number of repeats. The sequence (i.e. SEQ ID NO:1 and SEQ ID NO:2) or variants thereof may be used as a repeating sequence between one (1) and up to fifty (50) times, more preferably between ten (10) and up to thirty (30) times, and most preferably approximately twenty (20) times. The sequence (i.e. SEQ ID NO:1 and SEQ ID NO:2) or variants thereof may be used as contiguous repeats or may be used as non-contiguous repeats (with other amino acids, or amino acid analogues, placed between the repeating sequences).
The present invention specifically contemplates fusion proteins comprising a non-gum arabic protein or glycoprotein sequence and a portion of the gum arabic glycoprotein sequence (SEQ ID NO:1 and SEQ ID NO:2). It is not intended that the present invention be limited by the nature of the non-gum arabic glycoprotein sequence. In one embodiment, the non-gum arabic glycoprotein sequence is a green fluorescent protein.
As noted above, the present invention contemplates synthetic genes encoding such peptides. By xe2x80x9csynthetic genesxe2x80x9d it is meant that the nucleic acid sequence is derived using the peptide sequence of interest (in contrast to using the nucleic acid sequence from cDNA). In one embodiment, the present invention contemplates an isolated polynucleotide sequence encoding a polypeptide comprising at least a portion of the polypeptide of SEQ ID NO:1 and SEQ ID NO:2 or variants thereof. The present invention specifically contemplates a polynucleotide sequence comprising a nucleotide sequence encoding a polypeptide comprising one or more repeats of SEQ ID NO:1 and SEQ ID NO:2 or variants thereof. Importantly, it is not intended that the present invention be limited to the precise nucleic acid sequence encoding the polypeptide of interest.
The present invention contemplates synthetic genes encoding portions of HRGPs, wherein the encoded peptides contain one or more of the highly conserved Ser-Hyp4 (SEQ ID NO:3) motif(s). The present invention also contemplates synthetic genes encoding portions of RPRPs, wherein the encoded peptides contain one or more of the pentapeptide motif: Pro-Hyp-Val-Tyr-Lys (SEQ ID NO:4) and variants of this sequence such as X-Hyp-Val-Tyr-Lys (SEQ ID NO:5) and Pro-Hyp-Val-X-Lys (SEQ ID NO:6) and Pro-Pro-X-Tyr-Lys and Pro-Pro-X-Tyr-X (SEQ ID NO:8), where xe2x80x9cXxe2x80x9d can be Thr, Glu, Hyp, Pro, His and Ile. The present invention also contemplates synthetic genes encoding portions of AGPs, wherein the encoded peptides contain one or more Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9) repeats. Such peptides can be expressed in a variety of forms, including but not limited to fusion proteins.
With regard to motifs for HRGPs, the present invention contemplates a polynucleotide sequence comprising the sequence: 5xe2x80x2-CCA CCA CCT TCA CCT CCA CCC CCA TCT CCA-3xe2x80x2 (SEQ ID NO:10). With regard to motifs for AGPs, the present invention contemplates a polynucleotide sequence comprising the sequence: 5xe2x80x2-TCA CCA TCA CCA TCT CCT TCG CCA TCA CCC-3xe2x80x2 (SEQ ID NO:11). Of course, it is not intended that the present invention be limited by the particular sequence. Indeed, the present invention specifically contemplates sequences that are not identical but are nonetheless homologous to the sequences of SEQ ID NOS: 10 and 11. The present invention also contemplates sequences that are complementary (including sequences that are only partially complementary) sequences to the sequences of SEQ ID NOS: 10 and 11. Such complementary sequences include sequences that will hybridize to the sequences of SEQ ID NOS: 10 and 11 under low stringency conditions as well as high stringency conditions (see Definitions below).
The present invention also contemplates the mixing of motifs (i.e. modules) which are not found in wild-type sequences. For example, one might add GAGP modules to extensin and RPRP crosslinking modules to AGP-like molecules.
The present invention contemplates using the polynucleotides of the present invention for expression of the polypeptides in vitro and in vivo. Therefore, the present invention contemplates polynucleotide sequences encoding two or more repeats of the sequence of SEQ ID NO:1 and SEQ ID NO:2 or variants thereof, wherein said polynucleotide sequence is contained on a recombinant expression vector. It is also contemplated that such vectors will be introduced into a variety of host cells, both eukaryotic and prokaryotic (e.g. bacteria such as E. coli).
In one embodiment, the vector further comprises a promoter. It is not intended that the present invention be limited to a particular promoter. Any promoter sequence which is capable of directing expression of an operably linked nucleic acid sequence encoding a portion of a plant gum polypeptide (or other hydroxyproline-rich polypeptide of interest as described above) is contemplated to be within the scope of the invention. Promoters include, but are not limited to, promoter sequences of bacterial, viral and plant origins. Promoters of bacterial origin include, but are not limited to, the octopine synthase promoter, the nopaline synthase promoter and other promoters derived from native Ti plasmids. Viral promoters include, but are not limited to, the 35S and 19S RNA promoters of cauliflower mosaic virus (CaMV), and T-DNA promoters from Agrobacterium. Plant promoters include, but are not limited to, the ribulose-1,3-bisphosphate carboxylase small subunit promoter, maize ubiquitin promoters, the phaseolin promoter, the E8 promoter, and the Tob7 promoter.
The invention is not limited to the number of promoters used to control expression of a nucleic acid sequence of interest. Any number of promoters may be used so long as expression of the nucleic acid sequence of interest is controlled in a desired manner. Furthermore, the selection of a promoter may be governed by the desirability that expression be over the whole plant, or localized to selected tissues of the plant, e.g., root, leaves, fruit, etc. For example, promoters active in flowers are known (Benfy et al. (1990) Plant Cell 2:849-856).
The promoter activity of any nucleic acid sequence in host cells may be determined (i.e., measured or assessed) using methods well known in the art andexemplified herein. For example, a candidate promoter sequence may be tested by ligating it in-frame to a reporter gene sequence to generate a reporter construct, introducing the reporter construct into host cells (e.g. tomato or potato cells) using methods described herein, and detecting the expression of the reporter gene (e.g., detecting the presence of encoded mRNA or encoded protein, or the activity of a protein encoded by the reporter gene). The reporter gene may confer antibiotic or herbicide resistance. Examples of reporter genes include, but are not limited to, dhfr which confers resistance to methotrexate [Wigler M et al., (1980) Proc Natl Acad Sci 77:3567-70]; npt, which confers resistance to the aminoglycosides neomycin and G-418 [Colbere-Garapin F et al., (1981) J. Mol. Biol. 150:1-14] and als or pat, which confer resistance to chlorsulfuron and phosphinotricin acetyl transferase, respectively. Recently, the use of a reporter gene system which expresses visible markers has gained popularity with such markers as xcex2-glucuronidase and its substrate (X-Gluc), luciferase and its substrate (luciferin), and xcex2-galactosidase and its substrate (X-Gal) being widely used not only to identify transformants, but also to quantify the amount of transient or stable protein expression attributable to a specific vector system [Rhodes C A et al. (1995) Methods Mol Biol 55:121-131].
In addition to a promoter sequence, the expression construct preferably contains a transcription termination sequence downstream of the nucleic acid sequence of interest to provide for efficient termination. In one embodiment, the termination sequence is the nopaline synthase (NOS) sequence. In another embodiment the termination region comprises different fragments of sugarcane ribulose-1,5-biphosphate carboxylase/oxygenase (rubisco) small subunit (scrbcs) gene. The termination sequences of the expression constructs are not critical to the invention. The termination sequence may be obtained from the same gene as the promoter sequence or may be obtained form different genes.
If the mRNA encoded by the nucleic acid sequence of interest is to be efficiently translated, polyadenylation sequences are also commonly added to the expression construct. Examples of the polyadenylation sequences include, but are not limited to, the Agrobacterium octopine synthase signal, or the nopaline synthase signal.
The invention is not limited to constructs which express a single nucleic acid sequence of interest. Constructs which contain a plurality of (i.e., two or more) nucleic acid sequences under the transcriptional control of the same promoter sequence are expressly contemplated to be within the scope of the invention. Also included within the scope of this invention are constructs which contain the same or different nucleic acid sequences under the transcriptional control of different promoters. Such constructs may be desirable to, for example, target expression of the same or different nucleic acid sequences of interest to selected plant tissues.
As noted above, the present invention contemplates using the polynucleotides of the present invention for expression of a portion of plant gum polypeptides in vitro and in vivo. Where expression takes place in vivo, the present invention contemplates transgenic plants. The transgenic plants of the invention are not limited to plants in which each and every cell expresses the nucleic acid sequence of interest. Included within the scope of this invention is any plant (e.g. tobacco, tomato, maize, algae, etc.) which contains at least one cell which expresses the nucleic acid sequence of interest. It is preferred, though not necessary, that the transgenic plant express the nucleic acid sequence of interest in more than one cell, and more preferably in one or more tissue. It is particularly preferred that expression be followed by proper glycosylation of the plant gum polypeptide fragment or variant thereof, such that the host cell produces functional (e.g. in terms of use in the food or cosmetic industry) plant gum polypeptide.
The fact that transformation of plant cells has taken place with the nucleic acid sequence of interest may be determined using any number of methods known in the art. Such methods include, but are not limited to, restriction mapping of genomic DNA, PCR analysis, DNA-DNA hybridization, DNA-RNA hybridization, and DNA sequence analysis.
Expressed polypeptides (or fragments thereof) can be immobilized (covalently or non-covalently) on solid supports or resins for use in isolating HRGP-binding molecules from a variety of sources (e.g. algae, plants, animals, microorganisms). Such polypeptides can also be used to make antibodies.
The invention further provides a substantially purified polypeptide comprising at least a portion of the gum arabic consensus sequence. In particular, the invention provides a substantially purified polypeptide comprising at least a portion of amino acid sequence A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136), wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp, Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected from Hyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F is selected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr, Ala, and Ile; H is selected from Hyp, Pro, Leu, and Ile; I is selected from Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly, Leu, Ala, and Ile; and M is selected from His and Pro; and wherein the portion is greater than twelve contiguous amino acids of the amino acid sequence. In a preferred embodiment, the portion occurs in the polypeptide as a repeating sequence. In a more preferred embodiment, the repeating sequence repeats from 1 to 64 times. In an alternative preferred embodiment, A is Ser; B is selected from Hyp, and Leu; D is selected from Hyp, Ser, and Thr; E is Leu; F is Ser; G is selected from Ser, Leu, and Hyp; H is selected from Hyp, Pro, and Leu; I is selected from Thr and Ala; J is Thr; K is selected from Thr, Leu, and Hyp; L is selected from Gly and Leu; and M is selected from His and Pro. In another alternative embodiment, the amino acid sequence is selected from Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:143), Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:144), Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr-Gly-Pro-His (SEQ ID NO:145), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-Hyp (SEQ ID NO:146), Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:147), Ser-Hyp-Leu-Pro-Thr-Leu-Ser-Hyp-Leu-Pro-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:148), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-Hyp (SEQ ID NO:149), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:150), Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:151), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His (SEQ ID NO:152), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:153), Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:154), Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro-His (SEQ ID NO:155), Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Leu-Pro (SEQ ID NO:156), Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu (SEQ ID NO:157), Hyp-Hyp-Leu-Ser-Hyp-Leu-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:158), Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp (SEQ ID NO:159), Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-Hyp (SEQ ID NO:160), Hyp-Thr-Leu-Ser-Hyp-Leu-Pro-Ala-Hyp-Thr-Hyp-Hyp-Hyp-Gly (SEQ ID NO:161), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp (SEQ ID NO:162), Ser-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Thr (SEQ ID NO:163), Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp (SEQ ID NO:164), Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:165), Hyp-Hyp-Thr-Leu-Ser-Hyp-Hyp-Leu-Thr-Hyp (SEQ ID NO:166), Ser-Hyp-Hyp-Hyp-Ser-Leu-Ser-Hyp-Leu-Pro (SEQ ID NO:167), Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:168), Hyp-Leu-Ser-Hyp-Ser-Hyp-Ala-Hyp (SEQ ID NO:169), Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser (SEQ ID NO:170), Thr-Hyp-Hyp-Hyp-Gly-Pro (SEQ ID NO:171), Hyp-Hyp-Leu-Ser-Hyp-Ser (SEQ ID NO:172), Ser-Hyp-Leu-Pro-Ala-Hyp (SEQ ID NO:173), Leu-Pro-Thr-Leu-Ser-Hyp (SEQ ID NO:174), Ser-Hyp-Ser-Hyp (SEQ ID NO:175), Ser-Hyp-Thr-Hyp (SEQ ID NO:176), Thr-Hyp-Thr-Hyp (SEQ ID NO:177), Thr-Hyp-Hyp-Hyp (SEQ ID NO:178), Ser-Hyp-Pro-Pro-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:217), Ser-Hyp-Hyp-Pro-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:218), Ser-Hyp-Pro-Hyp-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:219), Ser-Hyp-Pro-Pro-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:220), Ser-Hyp-Hyp-Hyp-Pro-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:221), Ser-Hyp-Hyp-Pro-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:222), Ser-Hyp-Pro-Hyp-Hyp-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:223), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Pro-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:224), Ser-Hyp-Hyp-Hyp-Hyp-Leu-Ser-Hyp-Ser-Leu-Thr-Hyp-Thr-Hyp Hyp-Leu-Gly-Pro-His (SEQ ID NO:225), Ser-Hyp-Hyp-Hyp-Thr-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Hyp-Gly-Pro-His-Ser-Hyp-Hyp-Hyp-(Hyp) (SEQ ID NO:18), Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-Leu-Gly-Pro-His (SEQ ID NO:23), Ser- Hyp-Hyp-Hyp-A-Leu-Ser-Hyp-Ser-Hyp-Thr-Hyp-Thr-Hyp-Hyp-B-Gly-Pro-His (SEQ ID NO:179), where A is selected from Hyp, Thr, and Ser, and B is selected from Hyp and Lys, SEQ ID NO:131, and SEQ ID NO:133. In yet another alternative embodiment, the portion comprises a motif selected from (Xaa-Hyp)n (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein Xaa is any amino acid other than hydroxyproline, and wherein x is from 2 to 1000. In a preferred embodiment, the portion comprises the sequence Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9), and wherein Xaa is selected from Ser, Thr, and Ala. In a further alternative embodiment, the portion comprises a motif selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 100, and wherein Xaa is any amino acid other than hydroxyproline. In a preferred embodiment, the portion comprises a peptide sequence selected from Ser-Hyp2 (SEQ ID NO:211), Ser-Hyp3 (SEQ ID NO:212), Ser-Hyp4 (SEQ ID NO:3), Thr-Hyp2 (SEQ ID NO:213), and Thr-Hyp3 (SEQ ID NO:214). In an additional alternative embodiment, the portion comprises a peptide sequence selected from Ser-Hyp2-Pro (SEQ ID NO:215) and Ser-Hyp2-Pro-Hyp (SEQ ID NO:216).
The invention further provides a substantially purified polypeptide comprising a non-contiguous hydroxyproline motif. In particular, the invention provides a substantially purified polypeptide comprising a first motif selected from (Xaa-Hyp)x (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein Xaa is any amino acid other than hydroxyproline, and wherein x is from 2 to 1000. In one embodiment, the sequence is Xaa-Hyp-Xaa-Hyp (SEQ ID NO:9), wherein Xaa is selected from Ser, Thr, and Ala. In an alternative embodiment, the polypeptide further comprises a contiguous hydroxyproline motif (i.e., a second motif) selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 100, and wherein Xaa is any amino acid other than hydroxyproline. In a preferred embodiment, the first and second motifs alternate in the polypeptide. In a more preferred embodiment, the alternating first and second motifs repeat from 1 to 500 times.
Also provided herein is a substantially purified polypeptide comprising a motif selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 100, and wherein Xaa is any amino acid other than hydroxyproline. In one embodiment, the portion comprises a peptide sequence selected from Ser-Hyp2 (SEQ ID NO:211), Ser-Hyp3 (SEQ ID NO:212), Ser-Hyp4 (SEQ ID NO:3), Thr-Hyp2 (SEQ ID NO:213), and Thr-Hyp3 (SEQ ID NO:214).
The invention also provides a fusion protein comprising a first sequence selected from a non-gum arabic protein sequence and a non-gum arabic glycoprotein sequence operably linked to at least a portion of an amino acid sequence selected from (a) A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136), wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp, Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected from Hyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F is selected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr, Ala, and Ile; H is selected from Hyp, Pro, Leu, and Ile; I is selected from Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly, Leu, Ala, and Ile; and M is selected from His and Pro; and wherein the portion is greater than twelve contiguous amino acids of the amino acid sequence, (b) a polypeptide comprising a first motif selected from (Xaa-Hyp)x (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising a second motif selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 500, and (d) a polypeptide comprising the first motif and the second motif, wherein Xaa is any amino acid other than hydroxyproline. In one embodiment, the first sequence is a green fluorescent protein amino acid sequence.
Also provided by the invention is an isolated polynucleotide sequence encoding at least a portion of an amino acid sequence selected from (a) A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136), wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp, Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected from Hyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F is selected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr, Ala, and Ile; H is selected from Hyp, Pro, Leu, and Ile; I is selected from Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly, Leu, Ala, and Ile; and M is selected from His and Pro; and wherein the portion is greater than twelve contiguous amino acids of the amino acid sequence, (b) a polypeptide comprising a first motif selected from (Xaa-Hyp)x (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising a second motif selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 500, and (d) a polypeptide comprising the first motif and the second motif, wherein Xaa is any amino acid other than hydroxyproline.
The invention further provides a recombinant expression vector comprising a polynucleotide sequence encoding a portion of an amino acid sequence selected from (a) A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136), wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp, Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected from Hyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F is selected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr, Ala, and Ile; H is selected from Hyp, Pro, Leu, and Ile; I is selected from Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly, Leu, Ala, and Ile; and M is selected from His and Pro; and wherein the portion is greater than twelve contiguous amino acids of the amino acid sequence, (b) a polypeptide comprising a first motif selected from (Xaa-Hyp)x (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising a second motif selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 500, and (d) a polypeptide comprising the first motif and the second motif, wherein Xaa is any amino acid other than hydroxyproline. In one embodiment, the expression vector further comprises a promoter operably linked to the polynucleotide sequence. In a preferred embodiment, the promoter is a viral promoter. In a more preferred embodiment, the viral promoter is selected from the group consisting of the 35S and 19S RNA promoters of cauliflower mosaic virus. In an alternative preferred embodiment, the expression vector further comprises a signal sequence selected from extensin signal sequence (SEQ ID NO:14), and tomato arabinogalactan-protein signal sequence (SEQ ID NO:215). In a more preferred embodiment, the expression vector further comprises a reporter gene. In a yet more preferred embodiment, the reporter gene is the green fluorescence protein gene. In another embodiment, the vector is contained within a host cell. In a preferred embodiment, the host cell is a plant cell. In a more preferred embodiment, the plant cell expresses a glycoprotein comprising the portion.
Also provided herein is a method for producing at least a portion of a glycoprotein, comprising: a) providing: i) a recombinant expression vector comprising a polynucleotide sequence encoding at least a portion of an amino acid sequence selected from (a) A-Hyp-B-C-D-E-F-Hyp-G-H-I-Hyp-J-Hyp-Hyp-K-L-Pro-M (SEQ ID NO:136), wherein A is selected from Ser, Thr, and Ala; B is selected from Hyp, Pro, Leu, and Ile; C is selected from Pro and Hyp; D is selected from Hyp, Pro, Ser, Thr, and Ala; E is selected from Leu and Ile; F is selected from Ser, Thr, and Ala; G is selected from Ser, Leu, Hyp, Thr, Ala, and Ile; H is selected from Hyp, Pro, Leu, and Ile; I is selected from Thr, Ala, and Ser; J is selected from Thr, Ser, and Ala; K is selected from Thr, Leu, Hyp, Ser, Ala, and Ile; L is selected from Gly, Leu, Ala, and Ile; and M is selected from His and Pro; and wherein the portion is greater than twelve contiguous amino acids of the amino acid sequence, (b) a polypeptide comprising a first motif selected from (Xaa-Hyp)x (SEQ ID NO:182) and Xaa-Hyp-Xaa-Xaa-Hyp-Xaa (SEQ ID NO:183), wherein x is from 2 to 1000, (c) a polypeptide comprising a second motif selected from Xaa-Hyp-Hypn (SEQ ID NO:209) and Xaa-Pro-Hypn (SEQ ID NO:210), wherein n is from 1 to 500, and (d) a polypeptide comprising the first motif and the second motif, wherein Xaa is any amino acid other than hydroxyproline; and ii) a host cell; and b) introducing the vector into the host cell under conditions such that the portion is expressed. In one embodiment, the host cell is growing in culture. In a preferred embodiment, the method further comprises the step of c) recovering the portion from the host cell culture. In an alternative embodiment, the host cell is a plant cell. In a more preferred embodiment, the plant cell is derived from a plant selected from the family Leguminoseae.