The estimated 50,000-100,000 genes scattered along the human chromosomes offer tremendous promise for the understanding, diagnosis, and treatment of human diseases. In addition, probes capable of specifically hybridizing to loci distributed throughout the human genome find applications in the construction of high resolution chromosome maps and in the identification of individuals.
In the past, the characterization of even a single human gene was a painstaking process, requiring years of effort. Recent developments in the areas of cloning vectors, DNA sequencing, and computer technology have merged to greatly accelerate the rate at which human genes can be isolated, sequenced, mapped, and characterized. Cloning vectors such as yeast artificial chromosomes (YACs) and bacterial artificial chromosomes (BACs) are able to accept DNA inserts ranging from 300 to 1000 kilobases (cb) or 100-400 kb in length respectively, thereby facilitating the manipulation and ordering of DNA sequences distributed over great distances on the human chromosomes. Automated DNA sequencing machines permit the rapid sequencing of human genes. Bioinformatics software enables the comparison of nucleic acid and protein sequences, thereby assisting in the characterization of human gene products.
Currently, two different approaches are being pursued for identifying and characterizing the genes distributed along the human genome. In one approach, large fragments of genomic DNA are isolated, cloned, and sequenced. Potential open reading frames in these genomic sequences are identified using bioinformatics software. However, this approach entails sequencing large stretches of human DNA which do not encode proteins in order to find the protein encoding sequences scattered throughout the genome. In addition to requiring extensive sequencing, the bioinformatics software may mischaracterize the genomic sequences obtained. Thus, the software may produce false positives in which non-coding DNA is mischaracterized as coding DNA or false negatives in which coding DNA is mislabeled as non-coding DNA.
An alternative approach takes a more direct route to identifying and characterizing human genes. In this approach, complementary DNAs (cDNAs) are synthesized from isolated messenger RNAs (mRNAs) which encode human proteins. Using this approach, sequencing is only performed on DNA which is derived from protein coding portions of the genome. Often, only short stretches of the cDNAs are sequenced to obtain sequences called expressed sequence tags (ESTs). The ESTs may then be used to isolate or purify extended cDNAs which include sequences adjacent to the EST sequences. The extended cDNAs may contain all of the sequence of the EST which was used to obtain them or only a portion of the sequence of the EST which was used to obtain them. In addition, the extended cDNAs may contain the full coding sequence of the gene from which the EST was derived or, alternatively, the extended cDNAs may include portions of the coding sequence of the gene from which the EST was derived. It will be appreciated that there may be several extended cDNAs which include the EST sequence as a result of alternate splicing or the activity of alternative promoters. Alternatively, ESTs having partially overlapping sequences may be identified and contigs comprising the consensus sequences of the overlapping ESTs may be identified.
In the past, these short EST sequences were often obtained from oligo-dT primed cDNA libraries. Accordingly, they mainly corresponded to the 3xe2x80x2 untranslated region of the mRNA. In part, the prevalence of EST sequences derived from the 3xe2x80x2 end of the mRNA is a result of the fact that typical techniques for obtaining cDNAs, are not well suited for isolating cDNA sequences derived from the 5xe2x80x2 ends of mRNAs. (Adams et al., Nature 377:3-174, 1996, Hillier et al., Genome Res. 6:807-828, 1996).
In addition, in those reported instances where longer cDNA sequences have been obtained, the reported sequences typically correspond to coding sequences and do not include the full 5xe2x80x2 untranslated region (5xe2x80x2 UTR) of the mRNA from which the cDNA is derived. 5xe2x80x2 UTRs are often involved in the regulation of gene expression, by affecting either the stability or translation of mRNAs. Indeed, 5xe2x80x2 UTRs may contain several features known to affect the initiation of translation: (i) the distance between the cap structure and the initiation codon, (ii) the presence of cis-acting elements which may be either linear sequences such as polypyrimidine tracts (Kaspar et al, J. Biol. Chem. 267, 508-514, 1992; Severson et al., Eur J Biochem 229:426-32, 1995) or secondary structures such as IREs (Rouault and Klausner, Curr Top Cell Regul 35:1-19, 1997), and (iii) upstream open reading fraes or uORFs (Geballe and Morris, Trends Biotech Sci 19:159-64, 1994). Thus, regulation of gene expression may be achieved through the use of alternative 5xe2x80x2 UTRs. For instance, the translation of the tissue inhibitor of metalloprotease mRNA is enhanced in mitogenically activated cells through modification of the start codon of an uORF in its 5xe2x80x2 UTR using an alternative promoter (Waterhouse et al, J Biol Chem. 265:5585-9. 1990). Furthermore, modification of 5xe2x80x2 UTR through mutation, insertion or translocation events may even be implied in pathogenesis. For instance, the fragile X syndrome, the most common cause of inherited mental retardation, is partly due to an insertion of multiple CGG trinucleotide""s in the 5xe2x80x2 UTR of the fragile X mRNA resulting in the inhibition of protein synthesis via ribosome stalling (Feng et al, Science 268:731-4, 1995). An aberrant mutation in regions of the 5xe2x80x2 UTR known to inhibit translation of the proto-oncogene c-myc was shown to result in upregulation of C-myc protein levels in cells derived from patients with multiple myelomas (Willis et al, Curr Top Microbiol Immunol 224:269-76, 1997). However, the use of oligo-dT primed cDNA libraries does not allow the isolation of complete 5xe2x80x2 UTRs since such obtained incomplete sequences may not include the first exon of the mRNA, particularly in situations where the first exon is short. Furthermore, they may not include some exons, often short ones, which are located upstream of splicing sites. Thus, there is a need to obtain sequences derived from the 5xe2x80x2 ends of mRNAs.
While many sequences derived from human chromosomes have practical applications, approaches based on the identification and characterization of those chromosomal sequences which encode a protein product are particularly relevant to diagnostic and therapeutic uses. In some instances, the sequences used in such therapeutic or diagnostic techniques may be sequences which encode proteins which are secreted from the cell in which they are synthesized, as well as the secreted proteins themselves, are particularly valuable as potential therapeutic agents. Such proteins are often involved in cell to cell communication and may be responsible for producing a clinically relevant response in their target cells. In fact, several secretory proteins, including tissue plasminogen activator, G-CSF, GM-CSF, erythropoietin, human growth hormone, insulin, interferon-xcex1, interferon-xcex2, interferon-xcex3, and interleukin-2, are currently in clinical use. These proteins are used to treat a wide range of conditions, including acute myocardial infarction, acute ischemic stroke, anemia, diabetes, growth hormone deficiency, hepatitis, kidney carcinoma, chemotherapy-induced neutropenia and multiple sclerosis. For these reasons, extended cDNAs encoding secreted proteins or portions thereof represent a valuable source of therapeutic agents. Thus, there is a need for the identification and characterization of secreted proteins and the nucleic acids encoding them.
In addition to being therapeutically useful themselves, secretory proteins include short peptides, called signal peptides, at their amino termini which direct their secretion. These signal peptides are encoded by the signal sequences located at the 5xe2x80x2 ends of the coding sequences of genes encoding secreted proteins. These signal peptides can be used to direct the extracellular secretion of any protein to which they are operably linked. In addition, portions of the signal peptides called membrane-translocating sequences, may also be used to direct the intracellular import of a peptide or protein of interest. This may prove beneficial in gene therapy strategies in which it is desired to deliver a particular gene product to cells other than the cell in which it is produced. Signal sequences encoding signal peptides also find application in simplifying protein purification techniques. In such applications, the extracellular secretion of the desired protein greatly facilitates purification by reducing the number of undesired proteins from which the desired protein must be selected. Thus, there exists a need to identify and characterize the 5xe2x80x2 portions of the genes for secretory proteins which encode signal peptides.
Sequences coding for non-secreted proteins may also find application as therapeutics or diagnostics. In particular, such sequences may be used to determine whether an individual is likely to express a detectable phenotype, such as a disease, as a consequence of a mutation in the coding sequence for a non-secreted protein or for a secreted protein. In instances where the individual is at risk of suffering from a disease or other undesirable phenotype as a result of a mutation in such a coding sequence, the undesirable phenotype may be corrected by introducing a normal coding sequence using gene therapy. Alternatively, if the undesirable phenotype results from overexpression of the protein encoded by the coding sequence, expression of the protein may be reduced using antisense or triple helix based strategies.
The secreted or non-secreted human polypeptides encoded by the coding sequences may also be used as therapeutics by administering them directly to an individual having a condition, such as a disease, resulting from a mutation in the sequence encoding the polypeptide. In such an instance, the condition can be cured or ameliorated by administering the polypeptide to the individual.
In addition, the secreted or non-secreted human polypeptides or portions thereof may be used to generate antibodies useful in determining the tissue type or species of origin of a biological sample. The antibodies may also be used to determine the cellular localization of the secreted or non-secreted human polypeptides or the cellular localization of polypeptides which have been fused to the human polypeptides. In addition, the antibodies may also be used in immunoaffinity chromatography techniques to isolate, purify, or enrich the human polypeptide or a target polypeptide which has been fused to the human polypeptide.
Public information on the number of human genes for which the promoters and upstream regulatory regions have been identified and characterized is quite limited. In part, this may be due to the difficulty of isolating such regulatory sequences. Upstream regulatory sequences such as transcription factor binding sites are typically too short to be utilized as probes for isolating promoters from human genomic libraries. Recently, some approaches have been developed to isolate human promoters. One of them consists of making a CpG island library (Cross et al., Nature Genetics 6: 236-244, 1994). The second consists of isolating human genomic DNA sequences containing SpeI binding sites by the use of SpeI binding protein. (Mortlock et al., Genome Res. 6:327-335, 1996). Both of these approaches have their limits due to a lack of specificity or because they are not universally applicable since only a limited number of promoters have either a CpG island or a SpeI recognition site and because SpeI binding sites are not specifically found in promoter regions. Thus, there exists a need to identify and systematically characterize the 5xe2x80x2 portions of the genes.
The present 5xe2x80x2 EST""s may be used to efficiently identify and isolate 5xe2x80x2 UTRs and upstream regulatory regions which control the location, developmental stage, rate, and quantity of protein synthesis, as well as the stability of the mRNA. Once identified and characterized, these regulatory regions may be utilized in gene therapy or protein purification schemes to obtain the desired amount and locations of protein synthesis or to inhibit, reduce, or prevent the synthesis of undesirable gene products.
In addition, ESTs containing the 5xe2x80x2 ends of protein genes may include sequences useful as probes for chromosome mapping and the identification of individuals. Thus, there is a need to identify and characterize the sequences upstream of the 5xe2x80x2 coding sequences of genes.
The present invention relates to purified, isolated, or enriched 5xe2x80x2 ESTs which include sequences derived from the authentic 5xe2x80x2 ends of their corresponding mRNAs. The term xe2x80x9ccorresponding mRNAxe2x80x9d refers to the mRNA which was the template for the cDNA synthesis which produced the 5xe2x80x2 EST. These sequences will be referred to hereinafter as xe2x80x9c5xe2x80x2 ESTs.xe2x80x9d The present invention also includes purified, isolated or enriched nucleic acids comprising contigs assembled by determining a consensus sequences from a plurality of ESTs containing overlapping sequences. These contigs will be referred to herein as xe2x80x9cconsensus contigated ESTs.xe2x80x9d
As used herein, the term xe2x80x9cpurifiedxe2x80x9d does not require absolute purity; rather, it is intended as a relative definition. Individual 5xe2x80x2 EST clones isolated from a cDNA library have been conventionally purified to electrophoretic homogeneity. The sequences obtained from these clones could not be obtained directly either form the library or from total human DNA. The cDNA clones are not naturally occurring as such, but rather are obtained via manipulation of a partially purified naturally occurring substance (messenger RNA). The conversion of mRNA into a cDNA library involves the creation of a synthetic substance (cDNA) and pure individual cDNA clones can be isolated from the synthetic library by clonal selection. Thus, creating a cDNA library from messenger RNA and subsequently isolating individual clones from that library results in an approximately 104-106 fold purification of the native message. Purification of starting material or natural material to at least one order of magnitude, preferably two or three orders, and more preferably four or five orders of magnitude is expressly contemplated.
As used herein, the term xe2x80x9cisolatedxe2x80x9d requires that the material be removed from its original environment (e.g., the natural environment if it is naturally occurring). For example, a naturally-occurring polynucleotide present in a living animal is not isolated, but the same polynucleotide, separated from some or all of the coexisting materials in the natural system, is isolated.
As used herein, the term xe2x80x9cenrichedxe2x80x9d means that the 5xe2x80x2 EST is adjacent to xe2x80x9cbackbonexe2x80x9d nucleic acid to which it is not adjacent in its natural environment. Additionally, to be xe2x80x9cenrichedxe2x80x9d the 5xe2x80x2 ESTs will represent 5% or more of the number of nucleic acid inserts in a population of nucleic acid backbone molecules. Backbone molecules according to the present invention include nucleic acids such as expression vectors, self-replicating nucleic acids, viruses, integrating nucleic acids, and other vectors or nucleic acids used to maintain or manipulate a nucleic acid insert of interest. Preferably, the enriched 5xe2x80x2 ESTs represent 15% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules. More preferably, the enriched 5xe2x80x2 ESTs represent 50% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules. In a highly preferred embodiment, the enriched 5xe2x80x2 ESTs represent 90% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules.
xe2x80x9cStringentxe2x80x9d, xe2x80x9cmoderate,xe2x80x9d and xe2x80x9clowxe2x80x9d hybridization conditions are as defined below.
The term xe2x80x9cpolypeptidexe2x80x9d refers to, a polymer of amino acids without regard to the length of the polymer; thus, peptides, oligopeptides, and proteins are included within the definition of polypeptide. This term also does not specify or exclude post-expression modifications of polypeptides, for example, polypeptides which include the covalent attachment of glycosyl groups, acetyl groups, phosphate groups, lipid groups and the like are expressly encompassed by the term polypeptide. Also included within the definition are polypeptides which contain one or more analogs of an amino acid (including, for example, non-naturally occurring amino acids, amino acids which only occur naturally in an unrelated biological system, modified amino acids from mammalian systems etc.), polypeptides with substituted linkages, as well as other modifications known in the art, both naturally occurring and non-naturally occurring.
As used interchangeably herein, the terms xe2x80x9cnucleic acidsxe2x80x9d, xe2x80x9coligonucleotidesxe2x80x9d, and xe2x80x9cpolynucleotidesxe2x80x9d include RNA, DNA, or RNA/DNA hybrid sequences of more than one nucleotide in either single chain or duplex form. The term xe2x80x9cnucleotidexe2x80x9d as used herein as an adjective to describe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences of any length in single-stranded or duplex form. The term xe2x80x9cnucleotidexe2x80x9d is also used herein as a noun to refer to individual nucleotides or varieties of nucleotides, meaning a molecule, or individual unit in a larger nucleic acid molecule, comprising a purine or pyrimidine, a ribose or deoxyribose sugar moiety, and a phosphate group, or phosphodiester linkage in the case of nucleotides within an oligonucleotide or polynucleotide. Although the term xe2x80x9cnucleotidexe2x80x9d is also used herein to encompass xe2x80x9cmodified nucleotidesxe2x80x9d which comprise at least one modifications (a) an alternative linking group, (b) an analogous form of purine, (c) an analogous form of pyrimidine, or (d) an analogous sugar, for examples of analogous linking groups, purine, pyrimidines, and sugars see for example PCT publication No. WO 95/04064. The polynucleotide sequences of the invention may be prepared by any known method, including synthetic, recombinant, ex vivo generation, or a combination thereof, as well as utilizing any purification methods known in the art.
The terms xe2x80x9cbase pairedxe2x80x9d and xe2x80x9cWatson and Crick base pairedxe2x80x9d are used interchangeably herein to refer to nucleotides which can be hydrogen bonded to one another be virtue of their sequence identities in a manner like that found in double-helical DNA with thymine or uracil residues linked to adenine residues by two hydrogen bonds and cytosine and guanine residues linked by three hydrogen bonds (See Stryer, L., Biochemistry, 4th edition, 1995).
The terms xe2x80x9ccomplementaryxe2x80x9d or xe2x80x9ccomplement thereofxe2x80x9d are used herein to refer to the sequences of polynucleotides which is capable of forming Watson and Crick base pairing with another specified polynucleotide throughout the entirety of the complementary region. For the purpose of the present invention, a first polynucleotide is deemed to be complementary to a second polynucleotide when each base in the first polynucleotide is paired with its complementary base. Complementary bases are, generally, A and T (or A and U), or C and G. xe2x80x9cComplementxe2x80x9d is used herein as a synonym from xe2x80x9ccomplementary polynucleotidexe2x80x9d, xe2x80x9ccomplementary nucleic acidxe2x80x9d and xe2x80x9ccomplementary nucleotide sequencexe2x80x9d. These terms are applied to pairs of polynucleotides based solely upon their sequences and not any particular set of conditions under which the two polynucleotides would actually bind. Preferably, a xe2x80x9ccomplementaryxe2x80x9d sequence is a sequence which an A at each position where there is a T on the opposite strand, a T at each position where there is an A on the opposite strand, a G at each position where there is a C on the opposite strand and a C at each position where there is a G on the opposite strand.
Thus, 5xe2x80x2 ESTs in cDNA libraries in which one or more 5xe2x80x2 ESTs make up 5% or more of the number of nucleic acid inserts in the backbone molecules are xe2x80x9cenriched recombinant 5xe2x80x2 ESTsxe2x80x9d as defined herein. Likewise, 5xe2x80x2 ESTs in a population of plasmids in which one or more 5xe2x80x2 ESTs of the present invention have been inserted such that they represent 5% or more of the number of inserts in the plasmid backbone are xe2x80x9cenriched recombinant 5xe2x80x2 ESTsxe2x80x9d as defined herein. However, 5xe2x80x2 ESTs in cDNA libraries in which 5xe2x80x2 ESTs constitute less than 5% of the number of nucleic acid inserts in the population of backbone molecules, such as libraries in which backbone molecules having a 5xe2x80x2 EST insert are extremely rare, are not xe2x80x9cenriched recombinant 5xe2x80x2 ESTs.xe2x80x9d
In some embodiments, the present invention relates to 5xe2x80x2 ESTs which are derived from genes encoding secreted proteins. As used herein, a xe2x80x9csecretedxe2x80x9d protein is one which, when expressed in a suitable host cell, is transported across or through a membrane, including transport as a result of signal peptides in its amino acid sequence. xe2x80x9cSecretedxe2x80x9d proteins include without limitation proteins secreted wholly (e.g. soluble proteins), or partially (e.g. receptors) from the cell in which they are expressed. xe2x80x9cSecretedxe2x80x9d proteins also include without limitation proteins which are transported across the membrane of the endoplasmic reticulum.
Such 5xe2x80x2 ESTs include nucleic acid sequences, called signal sequences, which encode signal peptides which direct the extracellular secretion of the proteins encoded by the genes from which the 5xe2x80x2 ESTs are derived. Generally, the signal peptides are located at the amino termini of secreted proteins.
Secreted proteins are translated by ribosomes associated with the xe2x80x9croughxe2x80x9d endoplasmic reticulum. Generally, secreted proteins are co-translationally transferred to the membrane of the endoplasmic reticulum. Association of the ribosome with the endoplasmic reticulum during translation of secreted proteins is mediated by the signal peptide. The signal peptide is typically cleaved following its co-translational entry into the endoplasmic reticulum. After delivery to the endoplasmic reticulum, secreted proteins may proceed through the Golgi apparatus. In the Golgi apparatus, the proteins may undergo post-translational modification before entering secretory vesicles which transport them across the cell membrane.
The 5xe2x80x2 ESTs of the present invention have several important applications. For example, they may be used to obtain and express cDNA clones which include the full protein coding sequences of the corresponding gene products, including the authentic translation start sites derived from the 5xe2x80x2 ends of the coding sequences of the mRNAs from which the 5xe2x80x2 ESTs are derived. These cDNAs will be referred to hereinafter as xe2x80x9cfull-length cDNAs.xe2x80x9d These cDNAs may also include DNA derived from mRNA sequences upstream of the translation start site. The full-length cDNA sequences may be used to express the proteins corresponding to the 5xe2x80x2 ESTs. As discussed above, secreted proteins and non-secreted proteins may be therapeutically important. Thus, the proteins expressed from the cDNAs may be useful in treating or controlling a variety of human conditions. The 5xe2x80x2 ESTs may also be used to obtain the corresponding genomic DNA. The term xe2x80x9ccorresponding genomic DNAxe2x80x9d refers to the genomic DNA which encodes the mRNA from which the 5xe2x80x2 EST was derived.
Alternatively, the 5xe2x80x2 ESTs may be used to obtain and express extended cDNAs encoding portions of the protein. In the case of secreted proteins, the portions may comprise the signal peptides of the secreted proteins or the mature proteins generated when the signal peptide is cleaved off.
The present invention includes isolated, purified, or enriched xe2x80x9cEST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d or xe2x80x9cenrichedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cEST-related nucleic acidsxe2x80x9d means the nucleic acids of SEQ ID NOs: 24-4100 and 8178-36681, extended cDNAs obtainable using the nucleic acids of SEQ ID NOs: 24-4100 and 8178-36681, full-length cDNAs obtainable using the nucleic acids of SEQ ID NOs: 24-4100 and 8178-36681 or genomic DNAs obtainable using the nucleic acids of SEQ ID NOs: 24-4100 and 8178-36681. The present invention also includes the sequences complementary to the EST-related nucleic acids.
The present invention also includes isolated, purified, or enriched xe2x80x9cfragments of EST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d and xe2x80x9cenrichedxe2x80x9d have the meanings described above. As used herein the term xe2x80x9cfragments of EST-related nucleic acidsxe2x80x9d means fragments comprising at least 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides of the EST-related nucleic acids to the extent that fragments of these lengths are consistent with the lengths of the particular EST-related nucleic acids being referred to. The present invention also includes the sequences complementary to the fragments of the EST-related nucleic acids.
The present invention also includes isolated, purified, or enriched xe2x80x9cpositional segments of EST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d, or xe2x80x9cenrichedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d includes segments comprising nucleotides 1-25, 26-50, 51-75, 76-100, 101-125, 126-150, 151-175, 176-200, 201-225, 226-250, 251-300, 301-325, 326-350, 351-375, 376-400, 401-425, 426-450, 451-475, 476-500, 501-525, 526-550, 551-575, 576-600 and 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST-related nucleic acids being referred to. The term xe2x80x9cpositional segments of EST-related nucleic acids also includes segments comprising nucleotides 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-400, 401-450, 450-500, 501-550, 551-600 or 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST-related nucleic acids being referred to. The term xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d also includes segments comprising nucleotides 1-100, 101-200, 201-300, 301-400, 501-500, 500-600, or 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST-related nucleic acids being referred to. In addition, the term xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d includes segments comprising nucleotides 1-200, 201-400, 400-600, or 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST related nucleic acids being referred to. The present invention also includes the sequences complementary to the positional segments of EST-related nucleic acids.
The present invention also includes isolated, purified, or enriched xe2x80x9cfragments of positional segments of EST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d, or xe2x80x9cenrichedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cfragments of positional segments of EST-related nucleic acidsxe2x80x9d refers to fragments comprising at least 10, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 150, or 200 consecutive nucleotides of the positional segments of EST-related nucleic acids. The present invention also includes the sequences complementary to the fragments of positional segments of EST-related nucleic acids.
The present invention also includes isolated or purified xe2x80x9cEST-related polypeptides.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d or xe2x80x9cpurifiedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cEST-related polypeptidesxe2x80x9d means the polypeptides encoded by the EST-related nucleic acids, including the polypeptides of SEQ ID NOs: 4101-8177.
The present invention also includes isolated or purified xe2x80x9cfragments of EST-related polypeptides.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d or xe2x80x9cpurifiedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cfragments of EST-related polypeptidesxe2x80x9d means fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 consecutive amino acids of an EST-related polypeptide to the extent that fragments of these lengths are consistent with the lengths of the particular EST-related polypeptides being referred to.
The present invention also includes isolated or purified xe2x80x9cpositional segments of EST-related polypeptides.xe2x80x9d As used herein, the term xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d includes polypeptides comprising amino acid residues 1-25, 26-50, 51-75, 76-100, 101-125, 126-150, 151-175, 176-200, or 201-the C-terminal amino acid of the EST-related polypeptides to the extent that such amino acid residues are consistent with the lengths of the particular EST-related polypeptides being referred to. The term xe2x80x9cpositional segments of EST-related polypeptides also includes segments comprising amino acid residues 1-50, 51-100, 101-150, 151-200 or 201-the C-terminal amino acid of the EST-related polypeptides to the extent that such amino acid residues are consistent with the lengths of the particular EST-related polypeptides being referred to. The term xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d also includes segments comprising amino acids 1-100 or 101-200 of the EST-related polypeptides to the extent that such amino acid residues are consistent with the lengths of particular EST-related polypeptides being referred to. In addition, the term xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d includes segments comprising amino acid residues 1-200 or 201-the C-terminal amino acid of the EST-related polypeptides to the extent that amino acid residues are consistent with the lengths of the particular EST related polypeptides being referred to.
The present invention also includes isolated or purified xe2x80x9cfragments of positional segments of EST-related polypeptides.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d or xe2x80x9cpurifiedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cfragments of positional segments of EST-related polypeptidesxe2x80x9d means fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 consecutive amino acids of positional segments of EST-related polypeptides to the extent that fragments of these lengths are consistent with the lengths of the particular EST-related polypeptides being referred to.
The present invention also includes antibodies which specifically recognize the EST-related polypeptides, fragments of EST-related polypeptides, positional segments of EST-related polypeptides, or fragments of positional segments of EST-related polypeptides. In the case of secreted proteins, such as those of SEQ ID NOs: 7798-7888 antibodies which specifically recognize the mature protein generated when the signal peptide is cleaved may also be obtained as described below. Similarly, antibodies which specifically recognize the signal peptides of SEQ ID NOs: 4101-4729 or 7798-7888 may also be obtained.
In some embodiments and in the case of secreted proteins, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids include a signal sequence. In other embodiments, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may include the full coding sequence for the protein or, in the case of secreted proteins, the full coding sequence of the mature protein (i.e. the protein generated when the signal polypeptide is cleaved off). In addition, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may include regulatory regions upstream of the translation start site or downstream of the stop codon which control the amount, location, or developmental stage of gene expression.
As discussed above, both secreted and non-secreted human proteins may be therapeutically important. Thus, the proteins expressed from the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may be useful in treating or controlling a variety of human conditions.
The EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may be used in forensic procedures to identify individuals or in diagnostic procedures to identify individuals having genetic diseases resulting from abnormal gene expression. In addition, the EST-related nucleic acids, fragments of EST-related nucleic acids, positonal segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids are useful for constructing a high resolution map of the human chromosomes.
The present invention also relates to secretion vectors capable of directing the secretion of a protein of interest. Such vectors may be used in gene therapy strategies in which it is desired to produce a gene product in one cell which is to be delivered to another location in the body. Secretion vectors may also facilitate the purification of desired proteins.
The present invention also relates to expression vectors capable of directing the expression of an inserted gene in a desired spatial or temporal manner or at a desired level. Such vectors may include sequences upstream of the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids, such as promoters or upstream regulatory sequences.
The present invention also comprises fusion vectors for making chimeric polypeptides comprising a first polypeptide and a second polypeptide. Such vectors are useful for determining the cellular localization of the chimeric polypeptides or for isolating, purifying or enriching the chimeric polypeptides.
The EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may also be used for gene therapy to control or treat genetic diseases. In the case of secreted proteins, signal peptides may be fused to heterologous proteins to direct their extracellular secretion.
Bacterial clones containing Bluescipt plasmids having inserts containing the sequence of the non-clustered 5xe2x80x2 ESTs are presently stored at 80xc2x0 C. in 4% (v/v) glycerol in the inventor""s laboratories under the designations. The non-clustered 5xe2x80x2 ESTs are those which comprise a single EST from a single tissue in the listing of Table II. The inserts may be recovered from the stored materials by growing the appropriate clones on a suitable medium. The Bluescript DNA can then be isolated using plasmid isolation procedures familiar to those skilled in the art such as alkaline lysis minipreps or large scale alkaline lysis plasmid isolation procedures. If desired the plasmid DNA may be further enriched by centrifugation on a cesium chloride gradient, size exclusion chromatography, or anion exchange chromatography. The plasmid DNA obtained using these procedures may then be manipulated using standard cloning techniques familiar to those skilled in the art. Alternatively, a PCR can be done with primers designed at both ends of the inserted EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids. The PCR product which corresponds to the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids can then be manipulated using standard cloning techniques familiar to those skilled in the art.
One embodiment of the present invention is a purified nucleic acid comprising a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681.
Another embodiment of the present invention is a purified nucleic acid comprising at least 10 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681.
Another embodiment of the present invention is a purified nucleic acid comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681.
A further embodiment of the present invention is a purified nucleic acid comprising the coding sequence of a sequence selected from the group consisting of 24-4100.
Yet another embodiment of the present invention is a purified nucleic acid comprising the fall coding sequences of a sequence selected from the group consisting of SEQ ID NOs: 3721-3811 wherein the full coding sequence comprises the sequence encoding the signal peptide and the sequence encoding the mature protein.
Still another embodiment of the present invention is a purified nucleic acid comprising a contiguous span of a sequence selected from the group consisting of SEQ ID NOs: 3721-3811 which encodes the mature protein.
Another embodiment of the present invention is a purified nucleic acid comprising a contiguous span of a sequence selected from the group consisting of SEQ ID NOs: 24-652 and 3721-3811 which encodes the signal peptide.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID NOs: 4101-8177.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID NOs: 7798-7888.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a mature protein included in a sequence selected from the group consisting of the sequences of SEQ ID NOs: 7798-7888.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a signal peptide included in a sequence selected from the group consisting of the sequences of SEQ ID NOs: 4101-4729 and 7798-7888.
Another embodiment of the present invention is a purified nucleic acid at least 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500 or 1000 nucleotides in length which hybridizes under stringent conditions to a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID NOs: 4101-8177.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a sequence selected from the group consisting of SEQ ID NOs: 7798-7888.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a mature protein of a polypeptide selected from the group consisting of SEQ ID NOs: 7798-7888.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a signal peptide of a sequence selected from the group consisting of the polypeptides of SEQ ID NOs: 4101-4729 and 7798-7888.
Another embodiment of the present invention is a purified or isolated polypeptide comprising at least 10 consecutive amino acids of a sequence selected from the group consisting of the sequences of SEQ ID NOs: 4101-8177.
Another embodiment of the present invention is a method of making a cDNA comprising the steps of contacting a collection of mRNA molecules from human cells with a primer comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of the sequences complementary to SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, hybridizing said primer to an mRNA in said collection that encodes said protein reverse transcribing said hybridized primer to make a first cDNA strand from said mRNA, making a second cDNA strand complementary to said first cDNA strand and isolating the resulting cDNA encoding said protein comprising said first cDNA strand and said second cDNA strand.
Another embodiment of the present invention is a purified cDNA obtainable by the method of the preceding paragraph.
In one aspect of this embodiment, the cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a method of making a cDNA comprising the steps of obtaining a cDNA comprising a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, contacting said cDNA with a detectable probe comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and the sequences complementary to SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 under conditions which permit said probe to hybridize to said cDNA, identifying a cDNA which hybridizes to said detectable probe, and isolating said cDNA which hybridizes to said probe.
Another embodiment of the present invention is a purified cDNA obtainable by the method of the preceding paragraph.
In one aspect of this embodiment, the cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a method of making a cDNA comprising the steps of contacting a collection of mRNA molecules from human cells with a first primer capable of hybridizing to the polyA tail of said mRNA, hybridizing said first primer to said polyA tail, reverse transcribing said mRNA to make a first cDNA strand, making a second cDNA strand complementary to said first cDNA strand using at least one primer comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, and isolating the resulting cDNA comprising said first cDNA strand and said second cDNA strand.
Another embodiment of the present invention is a purified cDNA obtainable by the method of the preceding paragraph.
In one aspect of this embodiment, said cDNA encodes at least a portion of a human polypeptide.
In another aspect of the preceding method the second cDNA strand is made by contacting said first cDNA strand with a first pair of primers, said first pair of primers comprising a second primer comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and a third primer having a sequence therein which is included within the sequence of said first primer, performing a first polymerase chain reaction with said first pair of primers to generate a first PCR product, contacting said first PCR product with a second pair of primers, said second pair of primers comprising a fourth primer, said fourth primer comprising at least 15 consecutive nucleotides of said sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, and a fifth primer, wherein said fourth and fifth hybridize to sequences within said first PCR product, and performing a second polymerase chain reaction, thereby generating a second PCR product.
One aspect of this embodiment is a purified cDNA obtainable by the method of the preceding paragraph.
In another aspect of this embodiment, said cDNA encodes at least a portion of a human polypeptide.
Alternatively, the second cDNA strand may be made by contacting said first cDNA strand with a second primer comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, hybridizing said second primer to said first strand cDNA, and extending said hybridized second primer to generate said second cDNA strand.
One aspect of the above embodiment is a purified cDNA obtainable by the method of the preceding paragraph.
In a further aspect of this embodiment said cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a method of making a polypeptide comprising the steps of obtaining a cDNA which encodes a polypeptide encoded by a nucleic acid comprising a sequence selected from the group consisting of SEQ ID NOs: 24-4100 or a cDNA which encodes a polypeptide comprising at least 10 consecutive amino acids of a polypeptide encoded by a sequence selected from the group consisting of SEQ ID NOs: 24-4100, inserting said cDNA in an expression vector such that said cDNA is operably linked to a promoter, introducing said expression vector into a host cell whereby said host cell produces the protein encoded by said cDNA, and isolating said protein.
Another aspect of this embodiment is an isolated protein obtainable by the method of the preceding paragraph.
Another embodiment of the present invention is a method of obtaining a promoter DNA comprising the steps of obtaining genomic DNA located upstream of a nucleic acid comprising a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and the sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, screening said genomic DNA to identify a promoter capable of directing transcription initiation, and isolating said DNA comprising said identified promoter.
In one aspect of this embodiment, said obtaining step comprises walking from genomic DNA comprising a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and the sequences complementary to SEQ ID NOs 24-4100 and SEQ ID NOs: 8178-36681. In another aspect of this embodiment, said screening step comprises inserting genomic DNA located upstream of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and the sequences complementary to SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 into a promoter reporter vector. For example, said screening step may comprise identifying motifs in genomic DNA located upstream of a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and the sequences complementary to SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 which are transcription factor binding sites or transcription start sites.
Another embodiment of the present invention is a isolated promoter obtainable by the method of the paragraph above.
Another embodiment of the present invention is the inclusion of at least one sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, the sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and fragments comprising at least 15 consecutive nucleotides of said sequence in an array of discrete ESTs or fragments thereof of at least 15 nucleotides in length. In some aspects of this embodiment, the array includes at least two sequences selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, the sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, and fragments comprising at least 15 consecutive nucleotides of said sequences. In another aspect of this embodiment the array includes at least five sequences selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681, the sequences complementary to the sequences of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and fragments comprising at least 15 consecutive nucleotides of said sequences.
Another embodiment of the present invention is an enriched population of recombinant nucleic acids, said recombinant nucleic acids comprising an insert nucleic acid and a backbone nucleic acid, wherein at least 5% of said insert nucleic acids in said population comprise a sequence selected from the group consisting of SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681 and the sequences complementary to SEQ ID NOs: 24-4100 and SEQ ID NOs: 8178-36681.
Another embodiment of the present invention is a purified or isolated antibody capable of specifically binding to a polypeptide comprising a sequence selected from the group consisting of SEQ ID NOs: 4101-8177.
A purified or isolated antibody capable of specifically binding to a polypeptide comprising at least 10 consecutive amino acids of a sequence selected from the group consisting of SEQ ID NOs: 4101-8177.
An antibody composition capable of selectively binding to an epitope-containing fragment of a polypeptide comprising a contiguous span of at least 8 amino acids of any of SEQ ID NOs: 4101-8177, wherein said antibody is polyclonal or monoclonal.
Another embodiment of the present invention is a computer readable medium having stored thereon a sequence selected from the group consisting of a nucleic acid code of SEQ ID NOs: 24-4100 and 8178-36681 and a polypeptide code of SEQ ID NOs: 4101-8177.
Another embodiment of the present invention is a computer system comprising a processor and a data storage device wherein said data storage device has stored thereon a sequence selected from the group consisting of a nucleic acid code of SEQ ID NOs: 24-4100 and 8178-36681 and a polypeptide code of SEQ ID NOs: 4101-8177. In one aspect of this embodiment the computer system further comprises a sequence comparer and a data storage device having reference sequences stored thereon. For example, the sequence comparer may comprise a computer program which indicates polymorphisms. In another aspect of this embodiment, the computer system further comprises an identifier which identifies features in said sequence.
Another embodiment of the present invention is a method for comparing a first sequence to a reference sequence wherein said first sequence is selected from the group consisting of a nucleic acid code of SEQ ID NOs: 24-4100 and 8178-36681 and a polypeptide code of SEQ ID NOs: 4101-8177 comprising the steps of reading said first sequence and said reference sequence through use of a computer program which compares sequences and determining differences between said first sequence and said reference sequence with said computer program. In some aspects of this embodiment, said step of determining differences between the first sequence and the reference sequence comprises identifying polymorphisms.
Another embodiment of the present invention is a method for identifying a feature in a sequence selected from the group consisting of a nucleic acid code of SEQ ID NOs: 24-4100 and 8178-36681 and a polypeptide code of SEQ D NOs: 4101-8177 comprising the steps of reading said sequence through the use of a computer program which identifies features in sequences and identifying features in said sequence with said computer program.
Another embodiment of the present invention is a vector comprising a nucleic acid according to any one of the nucleic acids described above.
Another embodiment of the present invention is a host cell containing the above vector.
Another embodiment of the present invention is a method of making any of the nucleic acids described above comprising the steps of introducing said nucleic acid into a host cell such that said nucleic acid is present in multiple copies in each host cell and isolating said nucleic acid from said host cell.
Another embodiment of the present invention is a method of making a nucleic acid of any of the nucleic acids described above comprising the step of sequentially linking together the nucleotides in said nucleic acids.
Another embodiment of the present invention is a method of making any of the polypeptides described above wherein said polypeptides is 150 amino acids in length or less comprising the step of sequentially linking together the amino acids in said polypeptide.
Another embodiment of the present invention is a method of making any of the polypeptides described above wherein said polypeptides is 120 amino acids in length or less comprising the step of sequentially linking together the amino acids in said polypeptides.
SEQ ID NOs: 1, 3, 5, 7, 9, 11, and 13 are full-length cDNAs prepared using the methods described herein.
SEQ ID NOs: 2, 4, 6, 8, 10, 12, and 14 are the polypeptides encoded by the nucleic acids of SEQ ID NOs: 1, 3, 5, 7, 9, 11, and 13.
SEQ ID NOs: 15, 16, 18, 19, 21 and 22 are primers whose use is described in the specification.
SEQ ID NOs: 17, 20, and 23 are the sequences of nucleic acids containing transcription factor binding sites which were obtained as described below.
SEQ ID NOs: 24-652 are nucleic acids having an incomplete ORF which encodes a signal peptide. As used herein, an xe2x80x9cincomplete ORFxe2x80x9d is an open reading frame in which a start codon has been identified but no stop codon has been identified. The locations of the incomplete ORFs and sequences encoding signal peptides are listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the xe2x80x9cscorexe2x80x9d in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as xe2x80x9cseqxe2x80x9d in the accompanying Sequence Listing. The xe2x80x9c/xe2x80x9d in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein.
SEQ ID NOs: 653-3720 are nucleic acids having an incomplete ORF in which no sequence encoding a signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a sequence encoding a signal peptide in these nucleic acids. The locations of the incomplete ORFs are listed in the accompanying Sequence Listing.
SEQ ID NOs: 3721-3811 are nucleic acids having a complete ORF which encodes a signal peptide. As used herein, a xe2x80x9ccomplete ORFxe2x80x9d is an open reading frame in which a start codon and a stop codon have been identified. The locations of the complete ORFs and sequences encoding signal peptides are listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the xe2x80x9cscorexe2x80x9d in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as xe2x80x9cseqxe2x80x9d in the accompanying Sequence Listing. The xe2x80x9caxe2x80x9d in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein.
SEQ ID NOs: 3812-4100 are nucleic acids having a complete ORF in which no sequence encoding a signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a sequence encoding a signal peptide in these nucleic acids. The locations of the complete ORFs are listed in the accompanying Sequence Listing.
SEQ ID NOs: 4101-4729 are xe2x80x9cincomplete polypeptide sequencesxe2x80x9d which include a signal peptide. Incomplete polypeptide sequencesxe2x80x9d are polypeptide sequences encoded by nucleic acids in which a start codon has been identified but no stop codon has been identified. These polypeptides are encoded by the nucleic acids of SEQ ID NOs: 24-652. The location of the signal peptide is listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the xe2x80x9cscorexe2x80x9d in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as xe2x80x9cseqxe2x80x9d in the accompanying Sequence Listing. The xe2x80x9cPxe2x80x9d in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein.
SEQ ID NOs: 4730-7797 are incomplete polypeptide sequences in which no signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a signal peptide in these polypeptides. These polypeptides are encoded by the nucleic acids of SEQ ID NOs: 653-3720.
SEQ ID NOs: 7798-7888 are xe2x80x9ccomplete polypeptide sequencesxe2x80x9d which include a signal peptide. xe2x80x9cComplete polypeptide sequencesxe2x80x9d are polypeptide sequences encoded by nucleic acids in which a start codon and a stop codon have been identified. These polypeptides are encoded by the nucleic acids of SEQ ID NOs: 3721-3811. The location of the signal peptide is listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the xe2x80x9cscorexe2x80x9d in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as xe2x80x9cseqxe2x80x9d in the accompanying Sequence Listing. The xe2x80x9c/xe2x80x9d in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein.
SEQ ID NOs: 7889-8177 are complete polypeptide sequences in which no signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a signal peptide in these polypeptides. These polypeptides are encoded by the nucleic acids of SEQ ID NOs: 3812-4100.
SEQ ID NOs: 8178-36681 are nucleic acid sequences in which no open reading frame has been conclusively identified to date. However, it remains possible subsequent analysis will identify an open reading frame in these nucleic acids.
In the accompanying Sequence Listing, all instances of the symbol xe2x80x9cnxe2x80x9d in the nucleic acid sequences mean that the nucleotide can be adenine, guanine, cytosine or thymine. In some instances the polypeptide sequences in the Sequence Listing contain the symbol xe2x80x9cXaa.xe2x80x9d These xe2x80x9cXaaxe2x80x9d symbols indicate either (1) a residue which cannot be identified because of nucleotide sequence ambiguity or (2) a stop codon in the determined sequence where applicants believe one should not exist (if the sequence were determined more accurately). In some instances, several possible identities of the unknown amino acids may be suggested by the genetic code.