The estimated 50,000-100,000 genes scattered along the human chromosomes offer tremendous promise for the understanding, diagnosis, and treatment of human diseases. In addition, probes capable of specifically hybridizing to loci distributed throughout the human genome find applications in the construction of high resolution chromosome maps and in the identification of individuals.
In the past, the characterization of even a single human gene was a painstaking process, requiring years of effort. Recent developments in the areas of cloning vectors, DNA sequencing, and computer technology have merged to greatly accelerate the rate at which human genes can be isolated, sequenced, mapped, and characterized.
Currently, two different approaches are being pursued for identifying and characterizing the genes distributed along the human genome. In one approach, large fragments of genomic DNA are isolated, cloned, and sequenced. Potential open reading frames in these genomic sequences are identified using bioinformatics software. However, this approach entails sequencing large stretches of human DNA which do not encode proteins in order to find the protein encoding sequences scattered throughout the genome. In addition to requiring extensive sequencing, the bioinformatics software may mischaracterize the genomic sequences obtained, i.e., labeling non-coding DNA as coding DNA and vice versa.
An alternative approach takes a more direct route to identifying and characterizing human genes. In this approach, complementary DNAs (cDNAs) are synthesized from isolated messenger RNAs (mRNAs) which encode human proteins. Using this approach, sequencing is only performed on DNA which is derived from protein coding portions of the genome. Often, only short stretches of the cDNAs are sequenced to obtain sequences called expressed sequence tags (ESTs). The ESTs may then be used to isolate or purify extended cDNAs which include sequences adjacent to the EST sequences. The extended cDNAs may contain all of the sequence of the EST which was used to obtain them or only a portion of the sequence of the EST which was used to obtain them. In addition, the extended cDNAs may contain the full coding sequence of the gene from which the EST was derived or, alternatively, the extended cDNAs may include portions of the coding sequence of the gene from which the EST was derived. It will be appreciated that there may be several extended cDNAs which include the EST sequence as a result of alternate splicing or the activity of alternative promoters. Alternatively, ESTs having partially overlapping sequences may be identified and contigs comprising the consensus sequences of the overlapping ESTs may be identified.
In the past, these short EST sequences were often obtained from oligo-dT primed cDNA libraries. Accordingly, they mainly corresponded to the 3xe2x80x2 untranslated region of the mRNA. In part, the prevalence of EST sequences derived from the 3xe2x80x2 end of the mRNA is a result of the fact that typical techniques for obtaining cDNAs, are not well suited for isolating cDNA sequences derived from the 5xe2x80x2 ends of mRNAs (Adams et al., Nature 377:3-174, 1996, Hillier et al., Genome Res. 6:807-828, 1996).
In addition, in those reported instances where longer cDNA sequences have been obtained, the reported sequences typically correspond to coding sequences and do not include the full 5xe2x80x2 untranslated region (5xe2x80x2 UTR) of the mRNA from which the CDNA is derived. Indeed, 5xe2x80x2 UTRs have been shown to affect either the stability or translation of mRNAs. Thus, regulation of gene expression may be achieved through the use of alternative 5xe2x80x2 UTRs as shown, for instance, for the translation of the tissue inhibitor of metalloprotease mRNA in mitogenically activated cells (Waterhouse et al, J Biol Chem. 265:5585-9. 1990). Furthermore, modification of 5xe2x80x2 UTR through mutation, insertion or translocation events may even be implied in pathogenesis. For instance, the fragile X syndrome, the most common cause of inherited mental retardation, is partly due to an insertion of multiple CGG trinucleotides in the 5xe2x80x2 UTR of the fragile X mRNA resulting in the inhibition of protein synthesis via ribosome stalling (Feng et al, Science 268:731-4, 1995). An aberrant mutation in regions of the 5xe2x80x2 UTR known to inhibit translation of the proto-oncogene c-myc was shown to result in upregulation of C-myc protein levels in cells derived from patients with multiple myelomas (Willis et al, Curr Top Microbiol Immunol 224:269-76, 1997). In addition, the use of oligo-dT primed cDNA libraries does not allow the isolation of complete 5xe2x80x2 UTRs since such incomplete sequences obtained by this process may not include the first exon of the mRNA, particularly in situations where the first exon is short. Furthermore, they may not include some exons, often short ones, which are located upstream of splicing sites. Thus, there is a need to obtain sequences derived from the 5xe2x80x2 ends of mRNAs.
While many sequences derived from human chromosomes have practical applications, approaches based on the identification and characterization of those chromosomal sequences which encode a protein product are particularly relevant to diagnostic and therapeutic uses. In some instances, the sequences used in such therapeutic or diagnostic techniques may be sequences which encode proteins which are secreted from the cell in which they are synthesized. Those sequences encoding secreted proteins as well as the secreted proteins themselves, are particularly valuable as potential therapeutic agents. Such proteins are often involved in cell to cell communication and may be responsible for producing a clinically relevant response in their target cells. In fact, several secretory proteins, including tissue plasminogen activator, G-CSF, GM-CSF, erythropoietin, human growth hormone, insulin, interferon-xcex1, interferon-xcex2, interferon-xcex3, and interleukin-2, are currently in clinical use. These proteins are used to treat a wide range of conditions, including acute myocardial infarction, acute ischemic stroke, anemia, diabetes, growth hormone deficiency, hepatitis, kidney carcinoma, chemotherapy-induced neutropenia and multiple sclerosis. For these reasons, extended cDNAs encoding secreted proteins or portions thereof represent a valuable source of therapeutic agents. Thus, there is a need for the identification and characterization of secreted proteins and the nucleic acids encoding them.
In addition to being therapeutically useful themselves, secretory proteins include short peptides, called signal peptides, at their amino termini which direct their secretion. These signal peptides are encoded by the signal sequences located at the 5xe2x80x2 ends of the coding sequences of genes encoding secreted proteins. These signal peptides can be used to direct the extracellular secretion of any protein to which they are operably linked. In addition, portions of the signal peptides called membrane-translocating sequences, may also be used to direct the intracellular import of a peptide or protein of interest. This may prove beneficial in gene therapy strategies in which it is desired to deliver a particular gene product to cells other than the cells in which it is produced. Signal sequences encoding signal peptides also find application in simplifying protein purification techniques. In such applications, the extracellular secretion of the desired protein greatly facilitates purification by reducing the number of undesired proteins from which the desired protein must be selected. Thus, there exists a need to identify and characterize the 5xe2x80x2 portions of the genes for secretory proteins which encode signal peptides.
Sequences coding for non-secreted proteins may also find application as therapeutics or diagnostics. In particular, such sequences may be used to determine whether an individual is likely to express a detectable phenotype, such as a disease, as a consequence of a mutation in the coding sequence of a protein. In instances where the individual is at risk of suffering from a disease or other undesirable phenotype as a result of a mutation in such a coding sequence, the undesirable phenotype may be corrected by introducing a normal coding sequence using gene therapy. Alternatively, if the undesirable phenotype results from overexpression of the protein encoded by the coding sequence, expression of the protein may be reduced using antisense or triple helix based strategies.
The secreted or non-secreted human polypeptides encoded by the coding sequences may also be used as therapeutics by administering them directly to an individual having a condition, such as a disease, resulting from a mutation in the sequence encoding the polypeptide. In such an instance, the condition can be cured or ameliorated by administering the polypeptide to the individual.
In addition, the secreted or non-secreted human polypeptides or portions thereof may be used to generate antibodies useful in determining the tissue type or species of origin of a biological sample. For example, the antibodies may be used to distinguish between human and non-human cells and tissues or to distinguish between human tissues that do and do not express the polypeptides. The antibodies may also be used to determine the cellular localization of the secreted or non-secreted human polypeptides or the cellular localization of polypeptides which have been fused to the human polypeptides. In addition, the antibodies may also be used in immunoaffinity chromatography techniques to isolate, purify, or enrich the human polypeptide or a target polypeptide which has been fused to the human polypeptide. Public information on the number of human genes for which the promoters and upstream regulatory regions have been identified and characterized is quite limited. In part, this may be due to the difficulty of isolating such regulatory sequences. Upstream regulatory sequences such as transcription factor binding sites are typically too short to be utilized as probes for isolating promoters from human genomic libraries. Recently, some approaches have been developed to isolate human promoters. One of them consists of making a CpG island library (Cross et al., Nature Genetics 6: 236-244, 1994). The second consists of isolating human genomic DNA sequences containing SpeI binding sites by the use of SpeI binding protein (Mortlock et al., Genome Res. 6:327-335, 1996). Both of these approaches have their limits due to a lack of specificity and of comprehensiveness. Thus, there exists a need to identify and systematically characterize the 5xe2x80x2 portions of the genes.
The present 5xe2x80x2 ESTs may be used to efficiently identify and isolate 5xe2x80x2 UTRs and upstream regulatory regions which control the location, developmental stage, rate, and quantity of protein synthesis, as well as the stability of the mRNA. Once identified and characterized, these regulatory regions may be utilized in gene therapy or protein purification schemes to obtain the desired amount and locations of protein synthesis or to inhibit, reduce, or prevent the synthesis of undesirable gene products. The regulatory regions may also be used for expressing polypeptides in cell types from which the 5xe2x80x2 ESTs of the present invention were isolated.
In addition, ESTs containing the 5xe2x80x2 ends of protein genes may include sequences useful as probes for chromosome mapping and the identification of individuals. Thus, there is a need to identify and characterize the sequences upstream of the 5xe2x80x2 coding sequences of genes.
The present invention relates to purified, isolated, or enriched 5xe2x80x2 ESTs which include sequences derived from the authentic 5xe2x80x2 ends of their corresponding mRNAs. The term xe2x80x9ccorresponding mRNAxe2x80x9d refers to the mRNA which was the template for the cDNA synthesis which produced the 5xe2x80x2 EST. These sequences will be referred to hereinafter as xe2x80x9c5xe2x80x2 ESTsxe2x80x9d. The present invention also includes purified, isolated or enriched nucleic acids comprising contigs assembled by determining a consensus sequences from a plurality of ESTs containing overlapping sequences. These contigs will be referred to herein as xe2x80x9cconsensus contigated ESTs.xe2x80x9d
As used herein, the term xe2x80x9cpurifiedxe2x80x9d does not require absolute purity; rather, it is intended as a relative definition. Individual 5xe2x80x2 EST clones isolated from a cDNA library have been conventionally purified to electrophoretic homogeneity. The sequences obtained from these clones could not be obtained directly either from the library or from total human DNA. The cDNA clones are not naturally occurring as such, but rather are obtained via manipulation of a partially purified naturally occurring substance (messenger RNA). The conversion of mRNA into a cDNA library involves the creation of a synthetic substance (cDNA) and pure individual cDNA clones can be isolated from the synthetic library by clonal selection. Thus, creating a cDNA library from messenger RNA and subsequently isolating individual clones from that library results in an approximately 104-106 fold purification of the native message. Purification of starting material or natural material to at least one order of magnitude, preferably two or three orders, and more preferably four or five orders of magnitude is expressly contemplated. Alternatively, purification may be expressed as xe2x80x9cat leastxe2x80x9d a percent purity relative to heterologous polynucleotides (DNA, RNA or both). As a preferred embodiment, the polynucleotides of the present invention are at least; 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%, 96%, 96%, 98%, 99%, or 100% pure relative to heterologous polynucleotides. As a further preferred embodiment the polynucleotides have an xe2x80x9cat leastxe2x80x9d purity ranging from any number, to the thousandth position, between 90% and 100% (e.g., 5xe2x80x2 EST at least 99.995% pure) relative to heterologous polynucleotides. Additionally, purity of the polynucleotides may be expressed as a percentage (as described above) relative to all materials and compounds other than the carrier solution. Each number, to the thousandth position, may be claimed as individual species of purity.
As used herein, the term xe2x80x9cisolatedxe2x80x9d requires that the material be removed from its original environment (e.g., the natural environment if it is naturally occurring). For example, a naturally-occurring polynucleotide present in a living animal is not isolated, but the same polynucleotide, separated from some or all of the coexisting materials in the natural system, is isolated. Specifically excluded from the definition of xe2x80x9cisolatedxe2x80x9d are: naturally occurring chromosomes (e.g., chromosome spreads) artificial chromosome libraries, genomic libraries, and cDNA libraries that exist either as an in vitro nucleic acid preparation or as a transfected/transformed host cell preparation, wherein the host cells are either an in vitro heterogeneous preparation or plated as a heterogeneous population of single colonies. Also specifically excluded are the above libraries wherein the 5xe2x80x2 EST makes up less than 5% of the number of nucleic acid inserts in the vector molecules. Further specifically excluded are whole cell genomic DNA or whole cell RNA preparations (including said whole cell preparations which are mechanically sheared or enzymaticly digested). Further specifically excluded are the above whole cell preparations as either an in vitro preparation or as a heterogeneous mixture separated by electrophoresis (including blot transfers of the same) wherein the polynucleotide of the invention have not been further separated from the heterologous polynucleotides in the electrophoresis medium (e.g., further separating by excising a single band from a heterogeneous band population in an agarose gel or nylon blot).
As used herein, the term xe2x80x9crecombinantxe2x80x9d means that the 5xe2x80x2 EST is adjacent to xe2x80x9cbackbonexe2x80x9d nucleic acid to which it is not adjacent in its natural environment. Additionally, to be xe2x80x9cenrichedxe2x80x9d the 5xe2x80x2 ESTs will represent 5% or more of the number of nucleic acid inserts in a population of nucleic acid backbone molecules. Backbone molecules according to the present invention include nucleic acids such as expression vectors, self-replicating nucleic acids, viruses, integrating nucleic acids, and other vectors or nucleic acids used to maintain or manipulate a nucleic acid insert of interest. Preferably, the enriched 5xe2x80x2 ESTs represent 15% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules. More preferably, the enriched 5xe2x80x2 ESTs represent 50% or more of the number of nucleic acid inserts in the population of recombinant backbone molecules. In a highly preferred embodiment, the enriched 5xe2x80x2 ESTs represent 90% or more (including any integer between 90 and 100%, to the thousandth position, e.g., 99.5%) of the number of nucleic acid inserts in the population of recombinant backbone molecules.
xe2x80x9cStringentxe2x80x9d, xe2x80x9cmoderate,xe2x80x9d and xe2x80x9clowxe2x80x9d hybridization conditions are as defined below.
The term xe2x80x9cpolypeptidexe2x80x9d refers to a polymer of amino acids without regard to the length of the polymer; thus, peptides, oligopeptides, and proteins are included within the definition of polypeptide. This term also does not specify or exclude chemical or post-expression modifications of the polypeptides of the invention, although chemical or post-expression modifications of these polypeptides may be included excluded as specific embodiments. Therefore, for example, modifications to polypeptides which include the covalent attachment of glycosyl groups, acetyl groups, phosphate groups, lipid groups and the like are expressly encompassed by the term polypeptide. Further, polyeptides with these modifications may be specified as individual species to be included or excluded from the present invention. The natural or other chemical modifications, such as those listed in example above, can occur anywhere in a polypeptide, including the peptide backbone, the amino acid side-chains and the amino or carboxyl termini. It will be appreciated that the same type of modification may be present in the same or varying degrees at several sites in a given polypeptide. Also, a given polypeptide may contain many types of modifications. Polypeptides may be branched, for example, as a result of ubiquitination, and they may be cyclic, with or without branching. Modifications include acetylation, acylation, ADP-ribosylation, amidation, covalent attachment of flavin, covalent attachment of a heme moiety, covalent attachment of a nucleotide or nucleotide derivative, covalent attachment of a lipid or lipid derivative, covalent attachment of phosphotidylinositol, cross-linking, cyclization, disulfide bond formation, demethylation, formation of covalent cross-links, formation of cysteine, formation of pyroglutamate, formylation, gamma-carboxylation, glycosylation, GPI anchor formation, hydroxylation, iodination, methylation, myristoylation, oxidation, pegylation, proteolytic processing, phosphorylation, prenylation, racemization, selenoylation, sulfation, transfer-RNA mediated addition of amino acids to proteins such as arginylation, and ubiquitination. (See, for instance, PROTEINSxe2x80x94STRUCTURE AND MOLECULAR PROPERTIES, 2nd Ed., T. E. Creighton, W. H. Freeman and Company, New York (1993); POSTTRANSLATIONAL COVALENT MODIFICATION OF PROTEINS, B. C. Johnson, Ed., Academic Press, New York, pgs. 1-12 (1983); Seifter et al., Meth Enzymol 182:626-646 (1990); Rattan et al., Ann NY Acad Sci 663:48-62 (1992).). Also included within the definition are polypeptides which contain one or more analogs of an amino acid (including, for example, non-naturally occurring amino acids, amino acids which only occur naturally in an unrelated biological system, modified amino acids from mammalian systems etc.), polypeptides with substituted linkages, as well as other modifications known in the art, both naturally occurring and non-naturally occurring.
As used interchangeably herein, the terms xe2x80x9cnucleic acidsxe2x80x9d, xe2x80x9coligonucleotidesxe2x80x9d, and xe2x80x9cpolynucleotidesxe2x80x9d include RNA or DNA (either double or single stranded (coding or antisense), or RNA/NA hybrid sequences of more than one nucleotide in either single chain or duplex form (although each of the above species may be particularly specified). The term xe2x80x9cnucleotidexe2x80x9d as used herein as an adjective to describe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences of any length in single-stranded or duplex form. The term xe2x80x9cnucleotidexe2x80x9d is also used herein as a noun to refer to individual nucleotides or varieties of nucleotides, meaning a molecule, or individual unit in a larger nucleic acid molecule, comprising a purine or pyrimidine, a ribose or deoxyribose sugar moiety, and a phosphate group, or phosphodiester linkage in the case of nucleotides within an oligonucleotide or polynucleotide. Although the term xe2x80x9cnucleotidexe2x80x9d is also used herein to encompass xe2x80x9cmodified nucleotidesxe2x80x9d which comprise at least one modifications (a) an alternative linking group, (b) an analogous form of purine, (c) an analogous form of pyrimidine, or (d) an analogous sugar, for examples of analogous linking groups, purine, pyrimidines, and sugars see for example PCT publication No. WO 95/04064. Preferred modifications of the present invention include, but are not limited to, 5-fluorouracil, 5-bromouracil, 5-chlorouracil, 5-iodouracil, hypoxanthine, xantine, 4-acetylcytosine, 5-(carboxyhydroxylmethyl)uracil, 5-carboxymethylaminomethyl-2-thiouridine, 5-carboxymethylaminomethyluracil, dihydrouracil, beta-D-galactosylqueosine, inosine, N6-isopentenyladenine, 1-methylguanine, 1-methylinosine, 2,2-dimethylguanine, 2-methyladenine, 2-methylguanine, 3-methylcytosine, 5-methylcytosine, N6-adenine, 7-methylguanine, 5-methylaminomethyluracil, 5-methoxyaminomethyl-2-thiouracil, beta-D-mannosylqueosine, 5xe2x80x2-methoxycarboxymethyluracil, 5-methoxyuracil, 2-methylthio-N6-isopentenyladenine, uracil-5-oxyacetic acid (v) ybutoxosine, pseudouracil, queosine, 2-thiocytosine, 5-methyl-2-thiouracil, 2-thiouracil, 4-thiouracil, 5-methyluracil, uracil-5-oxyacetic acid methylester, uracil-5-oxyacetic acid, 5-methyl-2-thiouracil, 3-(3-amino-3-N-2-carboxypropyl)uracil, and 2,6-diaminopurine. Methylenemethylimino linked oligonucleosides as well as mixed backbone compounds having, may be prepared as described in U.S. Pat. Nos. 5,378,825; 5,386,023; 5,489,677; 5,602,240; and 5,610,289. Formacetal and thioformacetal linked oligonucleosides may be prepared as described in U.S. Pat. Nos. 5,264,562 and 5,264,564. Ethylene oxide linked oligonucleosides may be prepared as described in U.S. Pat. No. 5,223,618. Phosphinate oligonucleotides may be prepared as described in U.S. Pat. No. 5,508,270. Alkyl phosphonate oligonucleotides may be prepared as described in U.S. Pat. No. 4,469,863. 3xe2x80x2-Deoxy-3xe2x80x2-methylene phosphonate oligonucleotides may be prepared as described in U.S. Pat. Nos. 5,610,289 or 5,625,050. Phosphoramidite oligonucleotides may be prepared as described in U.S. Pat. No. 5,256,775 or U.S. Pat. No. 5,366,878. Alkylphosphonothioate oligonucleotides may be prepared as described in published PCT applications WO 94/17093 and WO 94/02499. 3xe2x80x2-Deoxy-3xe2x80x2-amino phosphoramidate oligonucleotides may be prepared as described in U.S. Pat. No. 5,476,925. Phosphotriester oligonucleotides may be prepared as described in U.S. Pat. No. 5,023,243. Borano phosphate oligonucleotides may be prepared as described in U.S. Pat. Nos. 5,130,302 and 5,177,198.
The polynucleotide sequences of the invention may be prepared by any known method, including synthetic, recombinant, ex vivo generation, or a combination thereof, as well as utilizing any purification methods known in the art.
In specific embodiments, the polynucleotides of the invention are at least 15, at least 30, at least 50, at least 100, at least 125, at least 500, or at least 1000 continuous nucleotides but are less than or equal to 300 kb, 200 kb, 100 kb, 50 kb, 10 kb, 7.5 kb, 5 kb, 2.5 kb, 2 kb, 1.5 kb, or 1 kb in length. In a further embodiment, polynucleotides of the invention comprise a portion of the coding sequences, as disclosed herein, but do not comprise all or a portion of any intron. In another embodiment, the polynucleotides comprising coding sequences do not contain coding sequences of a genomic flanking gene (i.e., 5xe2x80x2 or 3xe2x80x2 to the gene of interest in the genome). In other embodiments, the polynucleotides of the invention do not contain the coding sequence of more than 1000, 500, 250, 100, 75, 50, 25, 20, 15, 10, 5, 4, 3, 2, or 1 genomic flanking gene(s).The terms xe2x80x9cbase pairedxe2x80x9d and xe2x80x9cWatson and Crick base pairedxe2x80x9d are used interchangeably herein to refer to nucleotides which can be hydrogen bonded to one another be virtue of their sequence identities in a manner like that found in double-helical DNA with thymine or uracil residues linked to adenine residues by two hydrogen bonds and cytosine and guanine residues linked by three hydrogen bonds (See Stryer, L., Biochemistry, 4th edition, 1995).
The terms xe2x80x9ccomplementaryxe2x80x9d or xe2x80x9ccomplement thereofxe2x80x9d are used herein to refer to the sequences of polynucleotides which is capable of forming Watson and Crick base pairing with another specified polynucleotide throughout the entirety of the complementary region. For the purpose of the present invention, a first polynucleotide is deemed to be complementary to a second polynucleotide when each base in the first polynucleotide is paired with its complementary base. Complementary bases are, generally, A and T (or A and U), or C and G. xe2x80x9cComplementxe2x80x9d is used herein as a synonym from xe2x80x9ccomplementary polynucleotidexe2x80x9d, xe2x80x9ccomplementary nucleic acidxe2x80x9d and xe2x80x9ccomplementary nucleotide sequencexe2x80x9d. These terms are applied to pairs of polynucleotides based solely upon their sequences and not any particular set of conditions under which the two polynucleotides would actually bind. Preferably, a xe2x80x9ccomplementaryxe2x80x9d sequence is a sequence which an A at each position where there is a T on the opposite strand, a T at each position where there is an A on the opposite strand, a G at each position where there is a C on the opposite strand and a C at each position where there is a G on the opposite strand.
The terms xe2x80x9cvertebrate nucleic acidxe2x80x9d and xe2x80x9cvertebrate polypeptidexe2x80x9d are used herein to refer to any nucleic acid or polypeptide respectively which are derived from a vertebrate species including birds and more usually mammals, preferably primates such as humans, farm animals such as swine, goats, sheep, donkeys, and horses, rabbits or rodents, more preferably rats or mice. As used herein, the term xe2x80x9cvertebratexe2x80x9d is used to refer to any vertebrate, preferably a mammal. The term xe2x80x9cvertebratexe2x80x9d expressly embraces human subjects unless preceded with the term xe2x80x9cnon-humanxe2x80x9d.
Thus, 5xe2x80x2 ESTs in CDNA libraries in which one or more 5xe2x80x2 ESTs make up 5% or more of the number of nucleic acid inserts in the backbone molecules are xe2x80x9cenriched recombinant 5xe2x80x2 ESTsxe2x80x9d as defined herein. Likewise, 5xe2x80x2 ESTs in a population of plasmids in which one or more 5xe2x80x2 ESTs of the present invention have been inserted such that they represent 5% or more of the number of inserts in the plasmid backbone are xe2x80x9cenriched recombinant 5xe2x80x2 ESTsxe2x80x9d as defined herein. However, 5xe2x80x2 ESTs in cDNA libraries in which 5xe2x80x2 ESTs constitute less than 5% of the number of nucleic acid inserts in the population of backbone molecules, such as libraries in which backbone molecules having a 5xe2x80x2 EST insert are extremely rare, are not xe2x80x9cenriched recombinant 5xe2x80x2 ESTs.xe2x80x9d
The termxe2x80x9ccapable of hybridizing to the polyA tail of said mRNAxe2x80x9d refers to and embraces all primers containing stretches of thymidine residues, so-called oligo(dT) primers, that hybridize to the 3xe2x80x2 end of eukaryotic poly(A)+ mRNAs to prime the synthesis of a first cDNA strand. Techniques for generating said oligo(dT) primers and hybridizing them to mRNA to subsequently prime the reverse transcription of said hybridized mRNA to generate a first cDNA strand are well known to those skilled in the art and are described in Current Protocols in Molecular Biology, John Wiley and Sons, Inc. 1997 and Sambrook et al., Molecular Cloning: A Laboratory Manual, Second Edition, Cold Spring Harbor Laboratory Press, 1989, the entire disclosures of which are incorporated herein by reference. Preferably, said oligo(dT) primers are present in a large excess in order to allow the hybridization of all mRNA 3xe2x80x2 ends to at least one oligo(dT) molecule. The priming and reverse transcription step are preferably performed between 37xc2x0 C. and 55xc2x0 C. depending on the type of reverse transcriptase used.
Preferred oligo(dT) primers for priming reverse transcription of mRNAs are oligonucleotides containing a stretch of thymidine residues of sufficient length to hybridize specifically to the polyA tail of mRNAs, preferably of 12 to 18 thymidine residues in length. More preferably, such oligo(T) primers comprise an additional sequence upstream of the poly(dT) stretch in order to allow the addition of a given sequence to the 5xe2x80x2 end of all first cDNA strands which may then be used to facilitate subsequent manipulation of the cDNA. Preferably, this added sequence is 8 to 60 residues in length. For instance, the addition of a restriction site in 5xe2x80x2 of cDNAs facilitates subcloning of the obtained cDNA. Alternatively, such an added 5xe2x80x2 end may also be used to design primers of PCR to specifically amplify cDNA clones of interest.
In some embodiments, the present invention relates to 5xe2x80x2 ESTs which are derived from genes encoding secreted proteins. As used herein, a xe2x80x9csecretedxe2x80x9d protein is one which, when expressed in a suitable host cell, is transported across or through a membrane, including transport as a result of signal peptides in its amino acid sequence. xe2x80x9cSecretedxe2x80x9d proteins include without limitation proteins secreted wholly (e.g. soluble proteins), or partially (e.g. receptors) from the cell in which they are expressed. xe2x80x9cSecretedxe2x80x9d proteins also include without limitation proteins which are transported across the membrane of the endoplasmic reticulum.
Such 5xe2x80x2 ESTs include nucleic acid sequences, called signal sequences, which encode signal peptides which direct the extracellular secretion of the proteins encoded by the genes from which the 5xe2x80x2 ESTs are derived. Generally, the signal peptides are located at the amino termini of secreted proteins. Polypeptides comprising these signal peptides (as delineated in the sequence listing), and polynucleotides encoding the same, are preferred embodiments of the present invention.
Secreted proteins are translated by ribosomes associated with the xe2x80x9croughxe2x80x9d endoplasmic reticulum. Generally, secreted proteins are co-translationally transferred to the membrane of the endoplasmic reticulum. Association of the ribosome with the endoplasmic reticulum during translation of secreted proteins is mediated by the signal peptide. The signal peptide is typically cleaved following its co-translational entry into the endoplasmic reticulum. After delivery to the endoplasmic reticulum, secreted proteins may proceed through the Golgi apparatus. In the Golgi apparatus, the proteins may undergo post-translational modification before entering secretory vesicles which transport them across the cell membrane.
The 5xe2x80x2 ESTs of the present invention have several important applications. For example, the 5xe2x80x2 EST sequences of the sequence listing, and fragments thereof, may be used to distinguish human tissues or cells from non-human tissues or cells and to distinguish between human tissues or cells that do and do not express polynucleotides comprising the 5xe2x80x2 EST sequences of the present invention. By knowing the tissue expression pattern of the 5xe2x80x2 EST sequences, either through routine experimentation or by using the Tables herein, the polynucleotides of the present invention may be used in methods of determining the identity of an unknown tissue or cell sample. For example, if a 5xe2x80x2 EST is expressed in a particular tissue or cell type, as shown in the Tables below, and the unknown tissue or cell sample does not express the 5xe2x80x2 EST, it may be inferred that the unknown tissue or cells are either not human or not the same human tissue or cell type as that which expresses the 5xe2x80x2 EST. Conversely, if a 5xe2x80x2 EST is not expressed in a particular tissue or cell type, as shown in the Tables below, and the unknown tissue or cell sample does express the 5xe2x80x2 EST, it may be inferred that the unknown tissue or cells are either not human or not the same human tissue or cell type as that which does not express the 5xe2x80x2 EST. The above procedure may be used for either homogeneous tissue or cell samples or heterogeneous tissue or cell samples since one may only want to narrow the identity to human or non-human or to a tissue type. Further assays may be used in conjunction with the above methods to narrow or confirm the identification process. These methods of determining tissue or cell identity are based on methods which detect the presence or absence of the 5xe2x80x2 EST sequences in a tissue or cell sample using methods well know in the art (e.g., hybridization or PCR methods).
In other useful applications, fragments of the 5xe2x80x2 EST sequences encoding signal peptides as well as degenerate polynucleotides encoding the same, may be ligated to sequences encoding either the polypeptide from the same gene or to sequences encoding a heterologous polypeptide to facilitate secretion The 5xe2x80x2 EST sequences, and fragments thereof, may also be used to obtain and express cDNA clones which include the full protein coding sequences of the corresponding gene products, including the authentic translation start sites derived from the 5xe2x80x2 ends of the coding sequences of the mRNAs from which the 5xe2x80x2 ESTs are derived. These cDNAs will be referred to hereinafter as xe2x80x9cfull-length cDNAs.xe2x80x9d These cDNAs may also include DNA derived from mRNA sequences upstream of the translation start site. The full-length cDNA sequences may be used to express the proteins corresponding to the 5xe2x80x2 ESTs. As discussed above, secreted proteins and non-secreted proteins may be therapeutically important. Thus, the proteins expressed from the cDNAs may be useful in treating or controlling a variety of human conditions.
The 5xe2x80x2 ESTs may also be used to obtain the corresponding genomic DNA. The term xe2x80x9ccorresponding genomic DNAxe2x80x9d refers to the genomic DNA which encodes the mRNA from which the 5xe2x80x2 EST was derived.
Another use of the polynucleotides of the present invention is to map and clone promoter regions and open reading frames from a genomic sequence. For example, the 5xe2x80x2 ESTs can be used in combination with the sequence information from genome sequencing projects, such as the U.S. Human Genome Project or other public and private genome sequencing projects, to map and clone regions of the genome that comprise promoters and expressed open reading frames. The polynucleotides of the present invention are particularly useful for mapping and identifying coding regions (regions containing expressed open reading frames) from a genomic sequence since the vast majority of the human genome does not encode expressed genes and because of the difficulty in identifying authentic open reading frames (open reading frames that encode expressed genes). The 5xe2x80x2 EST sequences of the present invention can be used in conjunction with various algorithms to identify promoter or entire ORF sequences.
Alternatively, the 5xe2x80x2 ESTs may be used to obtain and express extended cDNAs encoding portions of the protein. In the case of secreted proteins, the portions may comprise the signal peptides of the secreted proteins or the mature proteins generated when the signal peptide is cleaved off.
The present invention includes isolated, purified, or enriched xe2x80x9cEST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d or xe2x80x9cenrichedxe2x80x9d have the meanings provided above.
As used herein, the term xe2x80x9cEST-related nucleic acidsxe2x80x9d means the nucleic acids of SEQ ID NOs. 24-3883 and 7744-19335, extended cDNAs obtainable using the nucleic acids of SEQ ID NOs. 24-3883 and 7744-19335, full-length cDNAs obtainable using the nucleic acids of SEQ ID NOs. 24-3883 and 7744-19335 or genomic DNAs obtainable using the nucleic acids of SEQ ID NOs. 24-3883 and 7744-19335. The present invention also includes the sequences complementary to, or allelic variants of, the EST-related nucleic acids.
The present invention also includes isolated, purified, or enriched xe2x80x9cfragments of EST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d and xe2x80x9cenrichedxe2x80x9d have the meanings described above. As used herein the term xe2x80x9cfragments of EST-related nucleic acidsxe2x80x9d means fragments comprising at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides of the EST-related nucleic acids to the extent that fragments of these lengths are consistent with the lengths of the particular EST-related nucleic acids being referred to. The present invention also includes the sequences complementary to the fragments of the EST-related nucleic acids. In particular, fragments of EST-related nucleic acids refer to the polynucleotides described in Tables IVa and IVb, and polynucleotides described in Tables IVa and IVb updated as defined below.
The present invention also includes isolated, purified, or enriched xe2x80x9cpositional segments of EST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d, or xe2x80x9cenrichedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d includes segments comprising nucleotides 1-25, 26-50, 51-75, 76-100, 101-125, 126-150, 151-175, 176-200, 201-225, 226-250, 251-300, 301-325, 326-350, 351-375, 376-400, 401-425, 426-450, 451-475, 476-500, 501-525, 526-550, 551-575, 576-600 and 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST-related nucleic acids being referred to, and wherein position xe2x80x9c1xe2x80x9d is defined as the 5xe2x80x2 most position defined in the sequence listing or Tables below. The term xe2x80x9cpositional segments of EST-related nucleic acids also includes segments comprising nucleotides 1-50, 51-100, 101-150, 151-200, 201-250, 251-300, 301-350, 351-400, 401-450, 450-500, 501-550, 551-600 or 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST-related nucleic acids being referred to. The term xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d also includes segments comprising nucleotides 1-100, 101-200, 201-300, 301-400, 501-500, 500-600, or 601xe2x80x94the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST-related nucleic acids being referred to. In addition, the term xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d includes segments comprising nucleotides 1-200, 201-400, 400-600, or 601-the terminal nucleotide of the EST-related nucleic acids to the extent that such nucleotide positions are consistent with the lengths of the particular EST related nucleic acids being referred to. The present invention also includes the sequences complementary to the positional segments of EST-related nucleic acids.
The present invention also includes isolated, purified, or enriched xe2x80x9cfragments of positional segments of EST-related nucleic acids.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d, xe2x80x9cpurifiedxe2x80x9d, or xe2x80x9cenrichedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cfragments of positional segments of EST-related nucleic acidsxe2x80x9d refers to fragments comprising at least 8, 10, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 150, or 200 consecutive nucleotides of the positional segments of EST-related nucleic acids. The present invention also includes the sequences complementary to the fragments of positional segments of EST-related nucleic acids.
In addition to the above xe2x80x9cpositional segments of EST-related nucleic acidsxe2x80x9d and xe2x80x9cfragments of positional segments of EST-related nucleic acidsxe2x80x9d, for the nucleic acids of SEQ ID NOs. 24-3883 and 7744-19335, further preferred nucleic acids comprise at least 8 nucleotides, wherein xe2x80x9cat least 8xe2x80x9d is defined as any integer between 8 and the integer representing the 3xe2x80x2 most nucleotide position in the sequence listing or Tables below. Further included are nucleic acid fragments at least 8 nucleotides in length, as described above, that are further specified in terms of their 5xe2x80x2 and 3xe2x80x2 position. The 5xe2x80x2 and 3xe2x80x2 positions are represented by the position number set forth in the sequence listing below. Therefore, every combination of a 5xe2x80x2 and 3xe2x80x2 nucleotide position that a fragment at least 8 contiguous nucleotides in length could occupy is included in the invention as an individual species. The polynucleotide fragment specified by 5xe2x80x2 and 3xe2x80x2 positions can be immediately envisaged and are therefore not individually listed solely for the purpose of not unnecessarily lengthening the specifications. It is noted that the above species of polynucleotides fragments of the present invention may alternatively be described by the formula xe2x80x9ca to bxe2x80x9d; where xe2x80x9caxe2x80x9d equals the 5xe2x80x9d nucleotide position and xe2x80x9cbxe2x80x9d equals 3xe2x80x3 nucleotide position of the polynucleotide fragment; and further where xe2x80x9caxe2x80x9d equals an integer between 1 and the number of nucleotides of the polynucleotide sequence of the present invention minus 8, and where xe2x80x9cbxe2x80x9d equals an integer between 9 and the number of nucleotides of the polynucleotide sequence of the present invention; and where xe2x80x9caxe2x80x9d is an integer smaller then xe2x80x9cbxe2x80x9d by at least 8.
The present invention also provides for the exclusion of any polynucleotide fragments specified by 5xe2x80x2 and 3xe2x80x2 positions or by size in nucleotides as described above. Any number of fragments specified by 5xe2x80x2 and 3xe2x80x2 positions or by size in nucleotides, as described above, may be excluded.
The present invention also includes isolated or purified xe2x80x9cEST-related polypeptides.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d or xe2x80x9cpurifiedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cEST-related polypeptidesxe2x80x9d means the polypeptides encoded by the EST-related nucleic acids, including the polypeptides of SEQ ID NOs. 3884-7743.
The present invention also includes isolated or purified xe2x80x9cfragments of EST-related polypeptides.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d or xe2x80x9cpurifiedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cfragments of EST-related polypeptidesxe2x80x9d means fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 consecutive amino acids of an EST-related polypeptide to the extent that fragments of these lengths are consistent with the lengths of the particular EST-related polypeptides being referred to. In particular, fragments of EST-related polypeptides refer to polypeptides encoded by polynucleotides described in Tables IVa and IVb, and polynucleotides described in Tables IVa and IVb updated.
The present invention also includes isolated or purified xe2x80x9cpositional segments of EST-related polypeptides.xe2x80x9d As used herein, the term xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d includes polypeptides comprising amino acid residues 1-25, 26-50, 51-75, 76-100, 101-125, 126-150, 151-175, 176-200, or 201-the C-terminal amino acid of the EST-related polypeptides to the extent that such amino acid residues are consistent with the lengths of the particular EST-related polypeptides being referred to. The term xe2x80x9cpositional segments of EST-related polypeptides also includes segments comprising amino acid residues 1-50, 51-100, 101-150, 151-200 or 201-the C-terminal amino acid of the EST-related polypeptides to the extent that such amino acid residues are consistent with the lengths of the particular EST-related polypeptides being referred to. The term xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d also includes segments comprising amino acids 1-100 or 101-200 of the EST-related polypeptides to the extent that such amino acid residues are consistent with the lengths of particular EST-related polypeptides being referred to. In addition, the term xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d includes segments comprising amino acid residues 1-200 or 201-the C-terminal amino acid of the EST-related polypeptides to the extent that amino acid residues are consistent with the lengths of the particular EST related polypeptides being referred to.
The present invention also includes isolated or purified xe2x80x9cfragments of positional segments of EST-related polypeptides.xe2x80x9d The terms xe2x80x9cisolatedxe2x80x9d or xe2x80x9cpurifiedxe2x80x9d have the meanings provided above. As used herein, the term xe2x80x9cfragments of positional segments of EST-related polypeptidesxe2x80x9d means fragments comprising at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150 consecutive amino acids of positional segments of EST-related polypeptides to the extent that fragments of these lengths are consistent with the lengths of the particular EST-related polypeptides being referred to.
In addition to the above xe2x80x9cpositional segments of EST-related polypeptidesxe2x80x9d and xe2x80x9cfragments of positional segments of EST-related polypeptidesxe2x80x9d, for the polypeptides of the present invention, further preferred polypeptides comprise at least 8 amino acids, wherein xe2x80x9cat least 8xe2x80x9d is defined as any integer between 8 and the integer representing the C-terminal amino acid of the polypeptide of the present invention including the polypeptide sequences of the sequence listing below. Further included are polypeptide fragments at least 8 amino acids in length, as described above, that are further specified in terms of their N-terminal and C-terminal positions. Preferred polypeptide fragment species specified by their N-terminal and C-terminal positions include the signal peptides delineated in the sequence listing below. However, included in the present invention as individual species are all polypeptide fragments, at least 5 amino acids in length, as described above, and may be particularly specified by a N-terminal and C-terminal position.
The present invention also provides for the exclusion of any fragments specified by N-terminal and C-terminal positions or by size in amino acid residues as described above. Any number of fragment species specified by N-terminal and C-terminal positions or sub-genus of fragments specified by size in amino acid residues as described above may be excluded from the present invention.
The polypeptide fragments of the present invention can be immediately envisaged using the above description and are therefore not individually listed solely for the purpose of not unnecessarily lengthening the specification. The above fragments need not be active since they would be useful, for example, inimmunoassays, in epitope mapping, epitope tagging, as vaccines, to raise antibodies, stimulate an immune response in a heterologous species, and as molecular weight markers. The above fragments may also be used to generate antibodies to a particular portion of the polypeptide. These antibodies can then be used in immunoassays well known in the art to distinguish between human and non-human cells and tissues or to determine whether cells or tissues in a biological sample are or are not of the same type which express the polypeptide of the present invention. Further preferred polypeptide fragments of the present invention comprise the signal peptides as delineated in the sequence listing. These signal peptides may be used to facilitate secretion of either the polypeptide of the same gene or a heterologous polypeptide.
The present invention also includes antibodies which specifically recognize the EST-related polypeptides, fragments of EST-related polypeptides, positional segments of EST-related polypeptides, or fragments of positional segments of EST-related polypeptides. In the case of secreted proteins, such as those of SEQ ID NOs. 5199-5919 antibodies which specifically recognize the mature protein generated when the signal peptide is cleaved may also be obtained as described below. Similarly, antibodies which specifically recognize the signal peptides of SEQ ID NOs. 3884-4243 or 5199-5919 may also be obtained.
In some embodiments and in the case of secreted proteins, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids include a signal sequence. In other embodiments, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may include the full coding sequence for the protein or, in the case of secreted proteins, the full coding sequence of the mature protein (i.e. the protein generated when the signal polypeptide is cleaved off). In addition, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may include regulatory regions upstream of the translation start site or downstream of the stop codon which control the amount, location, or developmental stage of gene expression.
As discussed above, both secreted and non-secreted human proteins may be therapeutically important. Thus, the proteins expressed from the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may be useful in treating or controlling a variety of human conditions.
The EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may be used in forensic procedures to identify individuals or in diagnostic procedures to identify individuals having genetic diseases resulting from abnormal gene expression. In addition, the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids are useful for constructing a high resolution map of the human chromosomes.
The present invention also relates to secretion vectors capable of directing the secretion of a protein of interest. Such vectors may be used in gene therapy strategies in which it is desired to produce a gene product in one cell which is to be delivered to another location in the body. Secretion vectors may also facilitate the purification of desired proteins. The secretion vectors may also be used to express a desired protein, such as a heterologous protein, such that the protein is secreted into the culture medium, thereby facilitating purification.
The present invention also relates to expression vectors capable of directing the expression of an inserted gene in a desired spatial or temporal manner or at a desired level. Such vectors may include sequences upstream of the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids, such as promoters or upstream regulatory sequences. Preferred chimeric polypeptides, and vectors encoding the same, comprise a signal peptide set forth in the sequence listing below.
The present invention also comprises fusion vectors for making chimeric polypeptides comprising a first polypeptide and a second polypeptide. Such vectors are useful for determining the cellular localization of the chimeric polypeptides or for isolating, purifying or enriching the chimeric polypeptides.
The EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids may also be used for gene therapy to control or treat genetic diseases. In the case of secreted proteins, signal peptides may be fused to heterologous proteins to direct their extracellular secretion.
Bacterial clones containing Bluescript plasmids having inserts containing the sequence of the non-aligned 5xe2x80x2 ESTs, also referred to as singletons, and sequences of the 5xe2x80x2 ESTs which were aligned to yield consensus contigated 5xe2x80x2 ESTs are presently stored at xe2x88x9280xc2x0 C. in 4% (v/v) glycerol in the inventor""s laboratories under internal designations. The non-aligned 5xe2x80x2 ESTs of the invention are those sequences which are present in the sequence listing but which identification number either corresponds to a single EST from a single tissue in the second column of Table V or is absent from the first column of Table V. The inserts may be recovered from the stored materials by growing the appropriate clones on a suitable medium. The Bluescript DNA can then be isolated using plasmid isolation procedures familiar to those skilled in the art such as alkaline lysis minipreps or large scale alkaline lysis plasmid isolation procedures. If desired the plasmid DNA may be further enriched by centrifugation on a cesium chloride gradient, size exclusion chromatography, or anion exchange chromatography. The plasmid DNA obtained using these procedures may then be manipulated using standard cloning techniques familiar to those skilled in the art. Alternatively, a PCR can be performed with primers designed at both ends of the inserted EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids. The PCR product which corresponds to the EST-related nucleic acids, fragments of EST-related nucleic acids, positional segments of EST-related nucleic acids, or fragments of positional segments of nucleic acids can then be manipulated using standard cloning techniques familiar to those skilled in the art.
One embodiment of the present invention is a purified nucleic acid comprising a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a purified nucleic acid comprising at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides, to the extent that fragments of these lengths are consistent with the specific sequence, of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
A further aspect of this embodiment is a purified vertebrate nucleic acid comprising at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides, to the extent that fragments of these lengths are consistent with the specific sequence, of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
A further aspect of this embodiment is a purified human nucleic acid comprising at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides, to the extent that fragments of these lengths are consistent with the specific sequence, of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a purified nucleic acid comprising at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides, to the extent that fragments of these lengths are consistent with the specific sequence, of a sequence selected from the group consisting of the preferred polynucleotides described in Tables IVa and IVb and sequences complementary to the sequences the preferred polynucleotides described in Tables IVa and IVb.
Another embodiment of the present invention is a purified nucleic acid comprising at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive nucleotides, to the extent that fragments of these lengths are consistent with the specific sequence, of a sequence selected from the group consisting of the preferred polynucleotides described in Tables IVa and IVb updated and sequences complementary to the sequences the preferred polynucleotides described in Tables IVa and IVb updated.
Another embodiment of the present invention is a purified nucleic acid comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
A further aspect of this embodiment is a purified vertebrate nucleic acid comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
A further aspect of this embodiment is a purified human nucleic acid comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a purified nucleic acid comprising at least 15 consecutive nucleotides of a sequence selected from the group consisting of the preferred polynucleotides described in Tables IVa and IVb and sequences complementary to the sequences of the preferred polynucleotides described in Tables IVa and IVb.
A further embodiment of the present invention is a purified nucleic acid comprising the coding sequence of a sequence selected from the group consisting of 24-3883.
Yet another embodiment of the present invention is a purified nucleic acid comprising the full coding sequences of a sequence selected from the group consisting of SEQ ID NOs. 1339-2059 wherein the full coding sequence comprises the sequence encoding the signal peptide and the sequence encoding the mature protein.
Still another embodiment of the present invention is a purified nucleic acid comprising a contiguous span of a sequence selected from the group consisting of SEQ ID NOs. 1339-2059 which encodes the mature protein.
Another embodiment of the present invention is a purified nucleic acid comprising a contiguous span of a sequence selected from the group consisting of SEQ ID NOs. 24-383 and 1339-2059 which encodes the signal peptide.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID NOs. 3884-7743.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID NOs. 5199-5919.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a mature protein included in a sequence selected from the group consisting of the sequences of SEQ ID NOs. 5199-5919.
Another embodiment of the present invention is a purified nucleic acid encoding a polypeptide comprising a signal peptide included in a sequence selected from the group consisting of the sequences of SEQ ID NOs. 3884-4243 and 5199-5919.
Another embodiment of the present invention is a purified nucleic acid of at least 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500 or 1000 nucleotides in length which hybridizes under stringent conditions to a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a vertebrate purified nucleic acid of at least 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500 or 1000 nucleotides in length which hybridizes under stringent conditions to a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a human purified nucleic acid of at least 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500 or 1000 nucleotides in length which hybridizes under stringent conditions to a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a sequence selected from the group consisting of the sequences of SEQ ID NOs. 3884-7743.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a sequence selected from the group consisting of SEQ ID NOs. 5199-5919.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a mature protein of a polypeptide selected from the group consisting of SEQ ID NOs. 5199-5919.
Another embodiment of the present invention is a purified or isolated polypeptide comprising a signal peptide of a sequence selected from the group consisting of the polypeptides of SEQ ID NOs. 3884-4243 and 5199-5919.
Another embodiment of the present invention is a purified or isolated polypeptide comprising at least 5, 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, 75, 100, 200, 300, 500, or 1000 consecutive amino acids, to the extent that fragments of these lengths are consistent with the specific sequence, of a sequence selected from the group consisting of the sequences of SEQ ID NOs. 3884-7743.
Another embodiment of the present invention is a method of making a cDNA comprising the steps of contacting a collection of mRNA molecules from human cells with a primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of the sequences complementary to SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, hybridizing said primer to an mRNA in said collection that encodes said protein, reverse transcribing said hybridized primer to make a first cDNA strand from said mRNA, making a second cDNA strand complementary to said first cDNA strand and isolating the resulting cDNA encoding said protein comprising said first cDNA strand and said second cDNA strand.
Another embodiment of the present invention is a purified cDNA obtainable by the method of the preceding paragraph. In one aspect of this embodiment, the cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a purified cDNA obtained by a method of making a cDNA of the invention. In one aspect of this embodiment, the cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a method of making a cDNA comprising the steps of contacting a cDNA collection with a detectable probe comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and the sequences complementary to SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 under conditions which permit said probe to hybridize to cDNA, identifying a cDNA which hybridizes to said detectable probe, and isolating said cDNA which hybridizes to said probe.
Another embodiment of the present invention is a purified cDNA obtainable by the method of the preceding paragraph. In one aspect of this embodiment, the cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a method of making a cDNA comprising the steps of contacting a collection of mRNA molecules from human cells with a first primer capable of hybridizing to the polyA tail of said mRNA, hybridizing said first primer to said polyA tail, reverse transcribing said mRNA to make a first cDNA strand, making a second cDNA strand complementary to said first cDNA strand using at least one primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, and isolating the resulting cDNA comprising said first cDNA strand and said second cDNA strand. In another aspect of this method the second cDNA strand is made by contacting said first cDNA strand with a second primer comprising at least 12, 15, 18, 90, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and a third primer which sequence is fully included within the sequence of said first primer, performing a first polymerase chain reaction with said second and third primers to generate a first PCR product, contacting said first PCR product with a fourth primer, comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of said sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, and a fifth primer, which sequence is fully included within the sequence of said third primer, wherein said fourth and fifth hybridize to sequences within said first PCR product, and performing a second polymerase chain reaction, thereby generating a second PCR product. Alternatively, the second cDNA strand may be made by contacting said first cDNA strand with a second primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and a third primer which sequence is fully included within the sequence of said first primer, performing a polymerase chain reaction with said second and third primers to generate said second cDNA strand. Alternatively, the second cDNA strand may be made by contacting said first cDNA strand with a second primer comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive nucleotides of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, hybridizing said second primer to said first strand cDNA, and extending said hybridized second primer to generate said second cDNA strand.
Another embodiment of the present invention is a purified cDNA obtainable by a method of making a cDNA of the invention. In one aspect of this embodiment, said cDNA encodes at least a portion of a human polypeptide.
Another embodiment of the present invention is a method of making a polypeptide comprising the steps of obtaining a cDNA which encodes a polypeptide encoded by a nucleic acid comprising a sequence selected from the group consisting of SEQ ID NOs. 24-3883 or a cDNA which encodes a polypeptide comprising at least 6, 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive amino acids of a polypeptide encoded by a sequence selected from the group consisting of SEQ ID NOs. 24-3883, inserting said cDNA in an expression vector such that said cDNA is operably linked to a promoter, introducing said expression vector into a host cell whereby said host cell produces the protein encoded by said cDNA, and isolating said protein.
Another embodiment of the present invention is a method of obtaining a promoter DNA comprising the steps of obtaining genomic DNA located upstream of a nucleic acid comprising a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and the sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, screening said genomic DNA to identify a promoter capable of directing transcription initiation, and isolating said DNA comprising said identified promoter.
In one aspect of this embodiment, said obtaining step comprises walking from genomic DNA comprising a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and the sequences complementary to SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335. In another aspect of this embodiment, said screening step comprises inserting genomic DNA located upstream of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and the sequences complementary to SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 into a promoter reporter vector. For example, said screening step may comprise identifying motifs in genomic DNA located upstream of a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and the sequences complementary to SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 which are transcription factor binding sites or transcription start sites.
Another embodiment of the present invention is a isolated promoter obtainable by the methods of the above paragraphs.
Another embodiment of the present invention is a isolated promoter obtained by the methods described in the above paragraphs.
Another embodiment of the present invention is the inclusion of at least one sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, the sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and fragments comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, or 100 consecutive nucleotides of said sequence in an array of discrete ESTs or fragments thereof of at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, or 100 nucleotides in length. In some aspects of this embodiment, the array includes at least two sequences selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, the sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, and fragments comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, or 100 consecutive nucleotides of said sequences. In another aspect of this embodiment, the array includes at least one, three, five, ten, fifteen, or twenty sequences selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335, the sequences complementary to the sequences of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and fragments comprising at least 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, 50, or 100 consecutive nucleotides of said sequences.
Another embodiment of the present invention is an enriched population of recombinant nucleic acids, said recombinant nucleic acids comprising an insert nucleic acid and a backbone nucleic acid, wherein at least 0.01%, 0.05%, 0.1%, 0.5%, 1%, 2%, 5%, 10%, or 20% of said insert nucleic acids in said population comprise a sequence selected from the group consisting of SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335 and the sequences complementary to SEQ ID NOs. 24-3883 and SEQ ID NOs. 7744-19335.
Another embodiment of the present invention is a purified or isolated antibody capable of specifically binding to a polypeptide comprising a sequence selected from the group consisting of SEQ ID NOs. 3884-7743.
Another embodiment of the present invention is a purified or isolated antibody capable of specifically binding to a polypeptide comprising at least 6, 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 consecutive amino acids of a sequence selected from the group consisting of SEQ ID NOs. 3884-7743.
Yet, another embodiment of the present invention is an antibody composition capable of selectively binding to an epitope-containing fragment of a polypeptide comprising a contiguous span of at least 8, 10, 12, 15, 18, 20, 23, 25, 28, 30, 35, 40, or 50 amino acids of any of SEQ ID NOs. 3884-7743, wherein said antibody is polyclonal or monoclonal.
Another embodiment of the present invention is a computer readable medium having stored thereon a sequence selected from the group consisting of a nucleic acid code of SEQ ID NOs. 24-3883 and 7744-19335 and a polypeptide code of SEQ ID NOs. 3884-7743.
Another embodiment of the present invention is a computer system comprising a processor and a data storage device wherein said data storage device has stored thereon a sequence selected from the group consisting of a nucleic acid code of SEQ ID NOs. 24-3883 and 7744-19335 and a polypeptide code of SEQ ID NOs. 3884-7743. In one aspect of this embodiment the computer system further comprises a sequence comparer and a data storage device having reference sequences stored thereon. For example, the sequence comparer may comprise a computer program which indicates polymorphisms. In another aspect of this embodiment, the computer system further comprises an identifier which identifies features in said sequence.
Another embodiment of the present invention is a method for comparing a first sequence to a reference sequence wherein said first sequence is selected from the group consisting of a nucleic acid code of SEQ ID NOs. 24-3883 and 7744-19335 and a polypeptide code of SEQ ID NOs. 3884-7743 comprising the steps of reading said first sequence and said reference sequence through use of a computer program which compares sequences and determining differences between said first sequence and said reference sequence with said computer program. In some aspects of this embodiment, said step of determining differences between the first sequence and the reference sequence comprises identifying polymorphisms.
Another embodiment of the present invention is a method for identifying a feature in a sequence selected from the group consisting of a nucleic acid code of SEQ ID NOs. 24-3883 and 7744-19335 and a polypeptide code of SEQ ID NOs. 3884-7743 comprising the steps of reading said sequence through the use of a computer program which identifies features in sequences and identifying features in said sequence with said computer program.
Another embodiment of the present invention is a vector comprising a nucleic acid according to any one of the nucleic acids described above.
Another embodiment of the present invention is a host cell containing the above vector.
Another embodiment of the present invention is a method of making any of the nucleic acids described above comprising the steps of introducing said nucleic acid into a host cell such that said nucleic acid is present in multiple copies in each host cell and isolating said nucleic acid from said host cell.
Another embodiment of the present invention is a method of making a nucleic acid of any of the nucleic acids described above comprising the step of sequentially linking together the nucleotides in said nucleic acids.
Another embodiment of the present invention is a method of making any of the polypeptides described above wherein said polypeptides is 150 amino acids in length or less comprising the step of sequentially linking together the amino acids in said polypeptide.
Another embodiment of the present invention is a method of making any of the polypeptides described above wherein said polypeptides is 120 amino acids in length or less comprising the step of sequentially linking together the amino acids in said polypeptides.
SEQ ID NOs. 1, 3, 5, 7, 9, 11, and 13 are full-length cDNAs prepared using the methods described herein.
SEQ ID NOs. 2, 4 and 6 are the signal peptides encoded by the nucleic acids of SEQ ID NOs. 1, 3 and 5 respectively.
SEQ ID NOs. 8, 10, 12, and 14 are the polypeptides encoded by the nucleic acids of SEQ ID NOs. 7, 9, 11, and 13 respectively.
SEQ ID NOs. 15, 16, 18, 19, 21 and 22 are primers whose use is described in the specification.
SEQ ID NOs. 17, 20, and 23 are the sequences of nucleic acids containing transcription factor binding sites which were obtained as described below.
SEQ ID NOs. 24-383 are nucleic acids having an incomplete ORF which encodes a signal peptide. As used herein, an xe2x80x9cincomplete ORFxe2x80x9d is an open reading frame in which a start codon has been identified but no stop codon has been identified. The locations of the incomplete ORFs and sequences encoding signal peptides are listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the xe2x80x9cscorexe2x80x9d in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as xe2x80x9cseqxe2x80x9d in the accompanying Sequence Listing. The xe2x80x9c/xe2x80x9d in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein.
SEQ ID NOs. 384-1338 are nucleic acids having an incomplete ORF in which no sequence encoding a signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a sequence encoding a signal peptide in these nucleic acids. The locations of the incomplete ORFs are listed in the accompanying Sequence Listing.
SEQ ID NOs. 1339-2059 are nucleic acids having a complete ORF which encodes a signal peptide. As used herein, a xe2x80x9ccomplete ORFxe2x80x9d is an open reading frame in which a start codon and a stop codon have been identified. The locations of the complete ORFs and sequences encoding signal peptides are listed in the accompanying Sequence Listing. In addition, the von Heijne score of the signal peptide computed as described below is listed as the xe2x80x9cscorexe2x80x9d in the accompanying Sequence Listing. The sequence of the signal-peptide is listed as xe2x80x9cseqxe2x80x9d in the accompanying Sequence Listing. The xe2x80x9c/xe2x80x9d in the signal peptide sequence indicates the location where proteolytic cleavage of the signal peptide occurs to generate a mature protein.
SEQ ID NOs. 2060-3883 are nucleic acids having a complete ORF in which no sequence encoding a signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a sequence encoding a signal peptide in these nucleic acids. The locations of the complete ORFs are listed in the accompanying Sequence Listing.
SEQ ID NOs. 3884-4243 are xe2x80x9cincomplete polypeptide sequencesxe2x80x9d which include a signal peptide. Incomplete polypeptide sequencesxe2x80x9d are polypeptide sequences encoded by nucleic acids in which a start codon has been identified but no stop codon has been identified. These polypeptides are encoded by the nucleic acids of SEQ ID NOs. 24-383. The location of the signal peptide is listed in the accompanying Sequence Listing.
SEQ ID NOs. 4244-5198 are incomplete polypeptide sequences in which no signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a signal peptide in these polypeptides. These polypeptides are encoded by the nucleic acids of SEQ ID NOs. 384-1338.
SEQ ID NOs. 5199-5919 are xe2x80x9ccomplete polypeptide sequencesxe2x80x9d which include a signal peptide. xe2x80x9cComplete polypeptide sequencesxe2x80x9d are polypeptide sequences encoded by nucleic acids in which a start codon and a stop codon have been identified. These polypeptides are encoded by the nucleic acids of SEQ ID NOs. 1339-2059. The location of the signal peptide is listed in the accompanying Sequence Listing.
SEQ ID NOs. 5920-7743 are complete polypeptide sequences in which no signal peptide has been identified to date. However, it remains possible that subsequent analysis will identify a signal peptide in these polypeptides. These polypeptides are encoded by the nucleic acids of SEQ ID NOs.2060-3883.
SEQ ID NOs. 7744-19335 are nucleic acid sequences in which no open reading frame of at least 150 nucleotides has been conclusively identified to date. However, it remains possible subsequent analysis will identify an open reading frame in these nucleic acids.
In the accompanying Sequence Listing, all instances of the symbol xe2x80x9cnxe2x80x9d in the nucleic acid sequences mean that the nucleotide can be adenine, guanine, cytosine or thymine. In some instances the polypeptide sequences in the Sequence Listing contain the symbol xe2x80x9cXaa.xe2x80x9d These xe2x80x9cXaaxe2x80x9d symbols indicate either (1) a residue which cannot be identified because of nucleotide sequence ambiguity or (2) a stop codon in the determined sequence where applicants believe one should not exist (if the sequence were determined more accurately). In some instances, several possible identities of the unknown amino acids may be suggested by the genetic code.
In the case of secreted proteins, it should be noted that, in accordance with the regulations governing Sequence Listings, in the appended Sequence Listing, the encoded protein (i.e. the protein containing the signal peptide and the mature protein or part thereof) extends from an amino acid residue having a negative number through a positively numbered amino acid residue. Thus, the first amino acid of the mature protein resulting from cleavage of the signal peptide is designated as amino acid number 1, and the first amino acid of the signal peptide is designated with the appropriate negative number.