The swift pace of discovery of new gene products by genomics and proteomics efforts and the growing availability of vast repositories of genes necessitates a strategy for analyzing proteins in a high-throughput manner. The high-density array format lends itself well to ordered high-throughput experimentation and analysis and has therefore become an established and widely-used format for high-throughput analysis of nucleic acids. Nucleic acid microarrays have enabled researchers to compare the expression of thousands of genes simultaneously. By making such comparisons, the expression patterns of clusters of genes in a particular biological context can be rapidly identified, which in turn can indicate groups of proteins that may act in concert in a specific pathway or process.
Reports of the analysis of protein function on a large scale are only just emerging. For example, a large-scale analysis of gene function in S. cerevisiae has been performed using a transposon-tagging strategy for the genome-wide characterization of disruption phenotypes, gene expression, and protein localization (Ross-Macdonald et al., Nature 402:413-418, 1999). In addition, complete two-hybrid analysis has been done using a large matrix of proteins for the interaction mapping of C. elegans proteins involved in vulval development (Walhout et al (2000) Science 287:116-122) and the S. cerevisiae genome (Uetz et al. (2000) Nature 403, 623-631); and Schwikowski (2000) Nature Biotech. 18:1257).
The concept of nonliving peptide and protein arrays has drawn considerable attention because this approach to high-throughput experimentation allows the direct analysis of discrete protein binding and enzymatic activities without the complications of adverse in vivo effects. For example, a low-density (96 well format) protein array has been developed in which proteins, spotted onto a nitrocellulose membrane and biomolecular interactions, were visualized by autoradiography Ge, H. ((2000) Nucleic Acids Res. 28:e3, I-VII). In another example, a high-density protein array (100,000 samples within 222xc3x97222 mm) that was used for antibody screening was formed by spotting proteins onto polyvinylidene difluoride (PVDF) (Lueking et al. (1999) Anal. Biochem. 270:103-111). Proteins have been printed on a flat glass plate that contained wells formed by an enclosing hydrophobic Teflon mask, and the arrayed antigens were detected using enzyme-linked immunosorbent assay (ELISA) techniques (Mendoza et al. (1999) Biotechniques 27:778-788.). A large-scale in vitro analysis of biochemical activity using affinity-purified yeast proteins has been performed in the context of an array of 6144 yeast strains, each bearing a plasmid expressing a different GST-ORF fusion (Martzen et al. (1999) Science 286, 1153-1155). Proteins have been covalently linked to chemically derivatized flat glass slides in a high-density array (1600 spots per square centimeter), and protein-protein and protein-small molecule interactions were detected by fluorescence or radioactive decay (MacBeath and Schreiber (2000) Science 289:1760-1763). De Wildt et al. generated a high-density array of 18,342 bacterial clones, each expressing a different single-chain antibody, for screening antibody-antigen interactions (De Wildt et al. (2000) Nature Biotech. 18:989-994).
The inventors have discovered, among other things, that arrays of polypeptides can be generated by translation of nucleic acid sequences encoding the polypeptides at individual addresses on the array. This allows for the rapid and versatile development of a polypeptide microarray platform for analyzing and manipulating biological information.
In one aspect, the invention features an array including a substrate having a plurality of addresses. Each address of the plurality includes: (1) a nucleic acid (e.g., a DNA or an RNA) encoding a hybrid amino acid sequence which includes a test amino acid sequence and an affinity tag; and, optionally, (2) a binding agent that recognizes the affinity tag. Optionally, each address of the plurality also includes one or both of (i) an RNA polymerase; and (ii) a translation effector.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acids encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The encoding nucleic acids can be nucleic acids (e.g., an mRNA or cDNA) expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides (i.e., test amino acid sequences) can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
In a preferred embodiment, each address of the plurality further includes one or more second nucleic acids, e.g., a plurality of unique nucleic acids. Hence, the plurality in toto can encode a plurality of test sequences. For example, each address of the plurality can encode a pool of test polypeptide sequences, e.g., a subset of a library or clone bank. A second array can be provided in which each address of the plurality of the second array includes a single or subset of members of the pool present at an address of the first array. The first and the second array can be used consecutively.
In other preferred embodiments, each address of the plurality further includes a second nucleic acid encoding a second amino acid sequence.
In one preferred embodiment, each address of the plurality includes a first test amino acid sequence that is common to all addresses of the plurality, and a second test amino acid sequence that is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, each address of the plurality includes a first test amino acid sequence that is unique among all the addresses of the plurality, and a second test amino acid sequence that is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
At at least one address of the plurality, the first and second amino acid sequences can be such that they interact with one another. In one preferred embodiment, they are capable of binding to each other. The second test amino acid sequence is optionally fused to a detectable amino acid sequence, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof). The second test amino acid sequence can be itself detectable (e.g., an antibody is available which specifically recognizes it). In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
Also featured is a database, e.g., in computer memory or a computer readable medium. Each record of the database can include a field for the amino acid sequence encoded by the nucleic acid sequence and a descriptor or reference for the physical location of the nucleic acid sequence on the array. Optionally, the record also includes a field representing a result (e.g., a qualitative or quantitative result) of detecting the polypeptide encoded by the nucleic acid sequence. The database can include a record for each address of the plurality present on the array. The records can be clustered or have a reference to other records (e.g., including hierarchical groupings) based on the result.
In another aspect, the invention features an array including a substrate having a plurality of addresses. Each address of the plurality includes: (1) an RNA encoding a hybrid amino acid sequence comprising a test amino acid sequence and an affinity tag; and (2) a binding agent that recognizes the affinity tag. Optionally, each address of the plurality also includes one or both of (i) a transcription effector; and (ii) a translation effector.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can further include one or more of: a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acids encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The encoding nucleic acids can be nucleic acids (e.g., an mRNA or cDNA) expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides (i.e., test amino acid sequences) can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
In a preferred embodiment, each address of the plurality further includes one or more second nucleic acids, e.g., a plurality of unique nucleic acids. Hence, the plurality in toto can encode a plurality of test sequences. For example, each address of the plurality can encode a pool of test polypeptide sequences, e.g., a subset of a library or clone bank. A second array can be provided in which each address of the plurality of the second array includes a single or subset of members of the pool present at an address of the first array. The first and the second array can be used consecutively.
In other preferred embodiments, each address of the plurality further includes a second nucleic acid encoding a second amino acid sequence.
In one preferred embodiment, each address of the plurality includes a first test amino acid sequence that is common to all addresses of the plurality, and a second test amino acid sequence that is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, each address of the plurality includes a first test amino acid sequence that is unique among all the addresses of the plurality, and a second test amino acid sequence that is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
At at least one address of the plurality, the first and second amino acid sequences can be such that they interact with one another. In one preferred embodiment, they are capable of binding to each other. The second test amino acid sequence is optionally fused to a detectable amino acid sequence, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof). The second test amino acid sequence can be itself detectable (e.g., an antibody is available which specifically recognizes it). In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate). In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In still another aspect, the invention features an array including a substrate having a plurality of addresses. Each address of the plurality includes: (1) a polypeptide comprising a test amino acid sequence and an affinity tag; and optionally (2) a binding agent. The binding agent is optimally capable of attaching to the affinity tag of the polypeptide. Optionally, each address of the plurality also includes a translation effector and/or a transcription effector.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence of the polypeptide is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag of the polypeptide at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses.
In a preferred embodiment, the polypeptide has more than one affinity tag. In another embodiment, the polypeptide of an address has an affinity tag that differs from at least one other affinity tag of a polypeptide in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
In another embodiment, each address of the plurality further includes a nucleic acid. The nucleic acid at each address of the plurality encodes the polypeptide. The nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
In another embodiment, the polypeptide further includes a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
In another embodiment, the polypeptide includes a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The polypeptide can also include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The test amino acid sequence can further includes a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
A variety of test amino acid sequences can be disposed at different addresses of the plurality. For example, the test amino acid sequences can be polypeptides expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
In a preferred embodiment, each address of the plurality further includes one or more second polypeptides. Hence, the plurality, in toto, can encode a plurality of test polypeptides. For example, each address of the plurality can include a pool of test polypeptide sequences, e.g., a subset of polypeptides encoded by a library or clone bank. A second array can be provided in which each address of the plurality of the second array includes a single or subset of members of the pool present at an address of the first array. The first and the second array can be used consecutively.
In other preferred embodiments, each address of the plurality further includes a second polypeptide.
In one preferred embodiment, each address of the plurality includes a first test amino acid sequence that is common to all addresses of the plurality, and a second test amino acid sequence that is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, each address of the plurality includes a first test amino acid sequence that is unique among all the addresses of the plurality, and a second test amino acid sequence that is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second test amino acid sequence can include a recognition tag and/or an affinity tag.
At at least one address of the plurality, the first and second amino acid sequences can be such that they interact with one another. In one preferred embodiment, they are capable of binding to each other. The second test amino acid sequence is optionally fused to a detectable amino acid sequence, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof). The second test amino acid sequence can be itself detectable (e.g., an antibody is available which specifically recognizes it). In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth. These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate). In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
Also featured is a database, e.g., in computer memory or a computer readable medium. Each record of the database can include a field for the amino acid sequence of the polypeptide at an address and a descriptor or reference for the physical location of the address on the array. Optionally, the record also includes a field representing a result (e.g., a qualitative or quantitative result) of detecting the polypeptide. The database can include a record for each address of the plurality present on the array. The records can be clustered or have a reference to other records (e.g., including hierarchical groupings) based on the result.
The invention also features a method of providing an array. The method includes: (1) providing a substrate with a plurality of addresses; and (2) providing at each address of the plurality at least (i) a nucleic acid encoding an amino acid sequence comprising a test amino acid sequence and an affinity tag, and optionally (ii) a binding agent that recognizes the affinity tag.
The method can further include contacting each address of the plurality with one or more of (i) a transcription effector, and (ii) a translation effector. Optionally, the substrate is maintained under conditions permissive for the amino acid sequence to bind the binding agent. One or more addresses can then be washed, e.g., to remove at least one of (i) the nucleic acid, (ii) the transcription effector, (iii) the translation effector, and/or (iv) an unwanted polypeptide, e.g., an unbound polypeptide or unfolded polypeptide. The array can optionally be contacted with a compound, e.g., a chaperone; a protease; a protein-modifying enzyme; a small molecule, e.g., a small organic compound (e.g., of molecular weight less than 5000, 3000, 1000, 700, 500, or 300 Daltons); nucleic acids; or other complex macromolecules e.g., complex sugars, lipids, or matrix molecules.
The array can be further processed, e.g., prepared for storage. It can be enclosed in a package, e.g., an air- or water-resistant package. The array can be desiccated, frozen, or contacted with a storage agent (e.g., a cryoprotectant, an anti-bacterial, an anti-fungal). For example, an array can be rapidly frozen after being optionally contacted with a cryoprotectant. This step can be done at any point in the process (e.g., before or after contacting the array with an RNA polymerase; before or after contacting the array with a translation effector; or before or after washing the array). The packaged product can be supplied to a user with or without additional contents, e.g., a transcription effector, a translation effector, a vector nucleic acid, an antibody, and so forth.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acid sequences encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The test amino acid sequences can be genes expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
In a preferred embodiment, each address of the plurality further includes one or more second nucleic acids, e.g., a plurality of unique nucleic acids. Hence, the plurality in toto can encode a plurality of test sequences. For example, each address of the plurality can encode a pool of test polypeptide sequences, e.g., a subset of a library or clone bank. A second array can be provided in which each address of the plurality of the second array includes a single or subset of members of the pool present at an address of the first array. The first and the second array can be used consecutively.
In other preferred embodiments, each address of the plurality further includes a second nucleic acid encoding a second amino acid sequence.
In one preferred embodiment, each address of the plurality includes a first test amino acid sequence that is common to all addresses of the plurality, and a second test amino acid sequence that is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, each address of the plurality includes a first test amino acid sequence that is unique among all the addresses of the plurality, and a second test amino acid sequence that is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
At at least one address of the plurality, the first and second amino acid sequences can be such that they interact with one another. In one preferred embodiment, they are capable of binding to each other. The second test amino acid sequence is optionally fused to a detectable amino acid sequence, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof). The second test amino acid sequence can be itself detectable (e.g., an antibody is available which specifically recognizes it). The method can further include detecting the second test amino acid sequence at each address of the plurality, e.g., by detecting the detectable amino acid sequence (e.g., the epitope tag, enzyme or fluorescent protein).
In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth. The method can further include detecting the modification at each address of the plurality.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
The method can further include providing a database, e.g., in computer memory or a computer readable medium. Each record of the database can include a field for the amino acid sequence encoded by the nucleic acid sequence and a descriptor or reference for the physical location of the nucleic acid sequence on the array. The database can include a record for each address of the plurality present on the array. Optionally, the method includes entering into the record also includes a field representing a result (e.g., a qualitative or quantitative result) of detecting the polypeptide encoded by the nucleic acid sequence. The method can also further include clustering or grouping the records based on the result.
The invention also features a method of providing an array to a user. The method includes providing the user with a substrate having a plurality of addresses and a vector nucleic acid. The vector nucleic acid can include one or more sites for insertion of a test amino acid sequence (e.g., a recombination site or a restriction site), and a sequence encoding an affinity tag. In a preferred embodiment, the vector nucleic acid has two sites for insertion, and a toxic gene inserted between the two sites. In another embodiment, the sites for insertion are homologous recombination or site-specific recombination sites, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, one or both recombination sites lack stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, one or both recombination sites include a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In a much preferred embodiment, the affinity tag is in frame with the translation frame of a nucleic acid sequence (e.g., a sequence to be inserted) encoding a test amino acid sequence. In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence. The cleavage site can be a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
In a preferred embodiment, the method includes providing the user with at least a second vector nucleic acid. The second vector nucleic acid can include one or more sites for insertion of a test amino acid sequence (e.g., a recombination site or a restriction site). In one embodiment, the second vector nucleic acid has a second test amino acid sequence inserted therein. Multiple nucleic acids can be provided, each having a unique test amino acid sequence, e.g., for disposal at a unique address of the substrate. The method can further include contacting each address with a transcription effector and/or a translation effector.
In a preferred embodiment, the second vector nucleic acid has a recognition tag, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof).
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses.
The first and/or second vector nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter.
In a preferred embodiment, the method further includes contacting the vector nucleic acid, and optionally the second vector nucleic acid, with a test nucleic acid which includes a nucleic acid encoding a test amino acid sequence so as to insert the test amino acid sequence into the vector nucleic acid. The test nucleic acid can be flanked, e.g., on both ends by a site, e.g., a site compatible with the vector nucleic acid (e.g., having sequence for recombination with a sequence in the vector; or having a restriction site which leaves an overhang or blunt end such that the overhang or blunt end can be ligated into the vector nucleic acid (e.g., the restricted vector nucleic acid)). The contact step can include contacting the vector nucleic acid with a recombinase, a ligase, and/or a restriction endonuclease. For example, the recombinase can mediate recombination, e.g., site-specific recombination or homologous recombination, between a recombination site on the test nucleic acid and a recombination sequence on the vector nucleic acid.
In a preferred embodiment, each address of the plurality has a binding agent capable of recognizing the affinity tag. The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In a preferred embodiment, the method further includes disposing at an address of the plurality a vector nucleic acid that includes a nucleic acid encoding a test amino acid sequence. This step can be repeated until a vector nucleic acid is disposed at each address of the plurality. In embodiments using a second vector nucleic acid in addition to the first, the method can include disposing at each address of the plurality a second vector nucleic acid encoding a different test amino acid sequence from the first vector nucleic acid.
In another preferred embodiment, the method further includes disposing at an address of the plurality a vector nucleic acid that does not include a nucleic acid encoding a test amino acid sequence and concurrently or separately disposing a nucleic acid encoding a test amino acid sequence. This step can be repeated until a vector nucleic acid is disposed at each address of the plurality. The method can also further including contacting each address of the plurality with a recombinase or a ligase.
The first or second vector nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The first or second vector nucleic acid sequence can further include a sequence encoding a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acids encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The encoding nucleic acids can be nucleic acids (e.g., an mRNA or cDNA) expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides (i.e., test amino acid sequences) can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
The method can further include detecting the first or the second test amino acid sequence at each address of the plurality.
In another preferred embodiment using a first and a second vector nucleic acid, one test amino acid sequence is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth. The method can further include detecting the modification at each address of the plurality.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
In another aspect, the invention features a method of providing an array of polypeptides. The method includes: (1) providing or obtaining a substrate with a plurality of addresses, each address of the plurality including (i) a nucleic acid encoding an amino acid sequence comprising a test amino acid sequence and an affinity tag, and (ii) a binding agent that recognizes the affinity tag; (2) contacting each address of the plurality with a translation effector to thereby translate the hybrid amino acid sequence; and (3) maintaining the substrate under conditions permissive for the amino acid sequence to bind the binding agent.
In one embodiment, the nucleic acid provided on the substrate is synthesized in situ, e.g., by light-directed chemistry. In another embodiment, each address of the plurality is provided with a nucleic acid, e.g., by pipetting, spotting, printing (e.g., with pins), piezoelectric delivery, or, e.g., other means of mechanical delivery. In a preferred embodiment, the provided nucleic acid is a template nucleic acid, and the method further includes amplifying the template, e.g., by PCR, NASBA, or RCA. The method can further include transcribing the nucleic acid to produce one or more RNA molecules encoding the test amino acid sequence.
The method can further include washing the substrate, e.g., after sufficient contact with a translation effector. The wash step can be repeated, e.g., one or more times, e.g., until a translation effector or translation effector component is removed. The wash step can remove unbound proteins. The stringency of the wash step can vary, e.g., the salt, pH, and buffer composition of the wash buffer can vary. For example, if the translated test polypeptide is covalently captured, or captured by an interaction resistant to chaotropes (e.g., binding of a 6-histidine motif to Ni2+ NTA), the substrate can be washed with a chaotrope, (e.g., guanidinium hydrochloride, or urea). In a subsequent step, the chaotrope can itself be washed from the array, and the polypeptides renatured.
In one embodiment, the nucleic acid sequence also encodes a cleavage site, e.g., a protease site, e.g., between the test amino acid sequence and the affinity tag. The method can further include contacting an address of the array with a protease that specifically recognizes the site.
The method can further include contacting the substrate with a second substrate. For example, in an embodiment wherein the substrate is a gel, the gel can be contacted with a second gel, and the contents of one gel can be transferred to another (e.g., by diffusion or electrophoresis). The method can include disrupting the binding between the affinity tag and the binding agent or between the binding agent and the substrate prior to transfer.
The method can further include contacting the substrate with living cells, and detecting an address wherein a parameter of the cell is altered relative to another address.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for i recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acid sequences encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The test amino acid sequences can be genes expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
In a preferred embodiment, each address of the plurality further includes one or more second nucleic acids, e.g., a plurality of unique nucleic acids. Hence, the plurality in toto can encode a plurality of test sequences. For example, each address of the plurality can encode a pool of test polypeptide sequences, e.g., a subset of a library or clone bank. A second array can be provided in which each address of the plurality of the second array includes a single or subset of members of the pool present at an address of the first array. The first and the second array can be used consecutively.
In other preferred embodiments, each address of the plurality further includes a second nucleic acid encoding a second amino acid sequence.
In one preferred embodiment, each address of the plurality includes a first test amino acid sequence that is common to all addresses of the plurality, and a second test amino acid sequence that is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, each address of the plurality includes a first test amino acid sequence that is unique among all the addresses of the plurality, and a second test amino acid sequence that is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
At at least one address of the plurality, the first and second amino acid sequences can be such that they interact with one another. In one preferred embodiment, they are capable of binding to each other. The second test amino acid sequence is optionally fused to a detectable amino acid sequence, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof). The second test amino acid sequence can be itself detectable (e.g., an antibody is available which specifically recognizes it). The method can further include detecting the second test amino acid sequence at each address of the plurality, e.g., by detecting the detectable amino acid sequence (e.g., the epitope tag, enzyme or fluorescent protein).
In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth. The method can further include detecting the modification at each address of the plurality.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate). In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In another aspect, the invention features a method of evaluating, e.g., identifying a polypeptide-polypeptide interaction. The method includes: (1) providing or obtaining a substrate with a plurality of addresses, each address of the plurality comprising (i) a first nucleic acid encoding an amino acid sequence comprising a first amino acid sequence and an affinity tag, (ii) a binding agent that recognizes the affinity tag, and (iii) a second nucleic acid encoding a second amino acid sequence; (2) contacting each address of the plurality with a translation effector to thereby translate the first nucleic acid and the second nucleic acid to synthesize the first and second amino acid sequences; and optionally (3) maintaining the substrate under conditions permissive for the hybrid amino acid sequence to bind binding agent.
In one preferred embodiment, the first amino acid sequence is common to all addresses of the plurality, and a second test amino acid sequence is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, the first amino acid sequence is unique among all the addresses of the plurality, and the second amino acid sequence is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
The method can further include detecting the presence of the second amino acid sequence at each of the plurality of addresses.
In one preferred embodiment, the second nucleic acid sequence also encodes a polypeptide tag. The polypeptide tag can be an epitope (e.g., recognized by a monoclonal antibody), or a binding agent (e.g., avidin or streptavidin, GST, or chitin binding protein). The detection of the second amino acid sequence can entail contacting each address of the plurality with a binding agent, e.g., a labeled biotin moiety, labeled glutathione, labeled chitin, a labeled antibody, etc. In another embodiment, each address of the plurality is contacted with an antibody specific to the second amino acid sequence.
In another preferred embodiment, the second nucleic acid sequence includes a recognition tag. The recognition tag can be an epitope tag, enzyme or fluorescent protein. Examples of enzymes include horseradish peroxidase, alkaline phosphatase, luciferase, or cephalosporinase. The method can further include contacting each address of the plurality with an appropriate cofactor and/or substrate for the enzyme. Examples of fluorescent proteins include green fluorescent protein (GFP), and variants thereof, e.g., enhanced GFP, blue fluorescent protein (BFP), cyan FP, etc. The detection of the second amino acid sequence can entail monitoring fluorescence, assessing enzyme activity, measuring an added binding agent, e.g., a labeled biotin moiety, a labeled antibody, etc.
In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth. The method can further include detecting the modification at each address of the plurality.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction. For example, the method can further include contacting each address of the plurality with a compound, e.g., a small organic molecule, a polypeptide, or a nucleic acid to thereby determine if the compound alters the interaction between the first and second amino acid.
In one preferred embodiment, the first amino acid sequence is a drug candidate, e.g. a random peptide, a randomized or mutated scaffold protein, or a secreted protein (e.g., a cell surface protein, an ectodomain of a transmembrane protein, an antibody, or a polypeptide hormone); and the second amino acid sequence is a drug target. A first amino acid sequence at an address where an interaction between the first amino acid sequence and the second amino acid is detected can be used as a candidate amino acid sequence for additional refinement or as a drug. The first amino acid sequence can be administered to a subject. A nucleic acid encoding the first amino acid sequence can be administered to a subject. In a related preferred embodiment, the first amino acid sequence is the drug target, and the second amino acid sequence is the drug candidate.
In a preferred embodiment, each first amino acid sequence in the plurality of addresses is unique. For example, a first amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the first amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other first amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the first nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the first nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the first nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The first and/or second nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the first and/or second nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The first and/or second nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the first and/or second nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the first and/or second nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The first nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The first and/or second nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The first and/or second amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The first and/or second nucleic acid sequences encoding the first and/or second amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The first and/or second nucleic acid sequences can be nucleic acids expressed in a tissue, e.g., a normal or diseased tissue. The first and/or second amino acid sequences can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, they are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches).
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In another aspect, the invention features a method of evaluating, e.g., identifying a polypeptide-polypeptide interaction. The method includes: (1) providing or obtaining an array made by the following process: (A) providing or obtaining a substrate with a plurality of addresses, each address having a binding agent that recognizes an affinity tag; (B) disposing in or on each address of the plurality (i) a first nucleic acid encoding an amino acid sequence comprising a first amino acid sequence and the affinity tag, and (ii) a second nucleic acid encoding a second amino acid sequence; and, optionally, (C) contacting each address of the plurality with a translation effector to thereby translate the first and second nucleic acid.
The method can further include maintaining the substrate under conditions permissive for the hybrid amino acid sequence to bind binding agent. The method can further include detecting the presence of the second amino acid sequence at each of the plurality of addresses.
In one preferred embodiment, the first amino acid sequence is common to all addresses of the plurality, and a second test amino acid sequence is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, the first amino acid sequence is unique among all the addresses of the plurality, and the second amino acid sequence is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
The method can further include detecting the presence of the second amino acid sequence at each of the plurality of addresses.
In one preferred embodiment, the second nucleic acid sequence also encodes a polypeptide tag. The polypeptide tag can be an epitope (e.g., recognized by a monoclonal antibody), or a binding agent (e.g., avidin or streptavidin, GST, or chitin binding protein). The detection of the second amino acid sequence can entail contacting each address of the plurality with a binding agent, e.g., a labeled biotin moiety, labeled glutathione, labeled chitin, a labeled antibody, etc. In another embodiment, each address of the plurality is contacted with an antibody specific to the second amino acid sequence.
In another preferred embodiment, the second nucleic acid sequence includes a recognition tag. The recognition tag can be an epitope tag, enzyme or fluorescent protein. Examples of enzymes include horseradish peroxidase, alkaline phosphatase, luciferase, or cephalosporinase. The method can further include contacting each address of the plurality with an appropriate cofactor and/or substrate for the enzyme. Examples of fluorescent proteins include green fluorescent protein (GFP), and variants thereof, e.g., enhanced GFP, blue fluorescent protein (BFP), cyan FP, etc. The detection of the second amino acid sequence can entail monitoring fluorescence, assessing enzyme activity, measuring an added binding agent, e.g., a labeled biotin moiety, a labeled antibody, etc.
In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth. The method can further include detecting the modification at each address of the plurality.
These embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction. For example, the method can further include contacting each address of the plurality with a compound, e.g., a small organic molecule, a polypeptide, or a nucleic acid to thereby determine if the compound alters the interaction between the first and second amino acid.
In one preferred embodiment, the first amino acid sequence is a drug candidate, e.g. a random peptide, a randomized or mutated scaffold protein, or a secreted protein (e.g., a cell surface protein, an ectodomain of a transmembrane protein, an antibody, or a polypeptide hormone); and the second amino acid sequence is a drug target. A first amino acid sequence at an address where an interaction between the first amino acid sequence and the second amino acid is detected can be used as a candidate amino acid sequence for additional refinement or as a drug. The first amino acid sequence can be administered to a subject. A nucleic acid encoding the first amino acid sequence can be administered to a subject. In a related preferred embodiment, the first amino acid sequence is the drug target, and the second amino acid sequence is the drug candidate.
In a preferred embodiment, each first amino acid sequence in the plurality of addresses is unique. For example, a first amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the first amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other first amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the first nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the first nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the first nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The first and/or second nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the first and/or second nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The first and/or second nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the first and/or second nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the first and/or second nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The first nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The first and/or second nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The first and/or second amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The first and/or second nucleic acid sequences encoding the first and/or second amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The first and/or second nucleic acid sequences can be nucleic acids expressed in a tissue, e.g., a normal or diseased tissue. The first and/or second amino acid sequences can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, they are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches).
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In another aspect, the method features a method of evaluating, e.g., identifying, a polypeptide-polypeptide interaction. The method includes: (1) providing or obtaining an array made by the following production method: (A) providing or obtaining a substrate with a plurality of addresses, each address of the plurality comprising (i) a first nucleic acid encoding a hybrid amino acid sequence comprising a first amino acid sequence and an affinity tag, (ii) a binding agent that recognizes the affinity tag, and (iii) a second nucleic acid encoding a second amino acid sequence; and (B) contacting each address of the plurality with a translation effector to thereby translate the first and second nucleic acid sequences. The evaluation method further includes: (2) at each of the plurality of addresses, detecting at least one parameter selected from the group consisting of: (i) the proximity of the second amino acid sequence to the first amino acid sequence; (ii) the proximity of the second amino acid sequence to the substrate or a compound bound thereto; (iii) the rotational freedom of the second amino acid sequence; and (iv) the refractive index of the substrate. The evaluation method can optionally include, e.g., prior to the detecting step, (3) maintaining the substrate under conditions permissive for the hybrid amino acid sequence to bind binding agent.
The method can further include washing the substrate prior to the detection step. The stringency of the wash step can be adjusted in order to remove the translation effector, and non-specifically bound proteins.
In one preferred embodiment, the first amino acid sequence is common to all addresses of the plurality, and a second test amino acid sequence is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, the first amino acid sequence is unique among all the addresses of the plurality, and the second amino acid sequence is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
The method can further include detecting the presence of the second amino acid sequence at each of the plurality of addresses.
In one preferred embodiment, the second nucleic acid sequence also encodes a polypeptide tag. The polypeptide tag can be an epitope (e.g., recognized by a monoclonal antibody), or a binding agent (e.g., avidin or streptavidin, GST, or chitin binding protein). The detection of the second amino acid sequence can entail contacting each address of the plurality with a binding agent, e.g., a labeled biotin moiety, labeled glutathione, labeled chitin, a labeled antibody, etc. In another embodiment, each address of the plurality is contacted with an antibody specific to the second amino acid sequence. The antibody can be labeled, e.g., with a fluorophore.
In another preferred embodiment, the second nucleic acid sequence includes a recognition tag. The recognition tag can be an epitope tag, enzyme or fluorescent protein. Examples of enzymes include horseradish peroxidase, alkaline phosphatase, luciferase, or cephalosporinase. The method can further include contacting each address of the plurality with an appropriate cofactor and/or substrate for the enzyme. Examples of fluorescent proteins include green fluorescent protein (GFP), and variants thereof, e.g., enhanced GFP, blue fluorescent protein (BFP), cyan FP, etc.
The method can further include contacting each address of the plurality with a compound, e.g., a small organic molecule, a polypeptide, or a nucleic acid to thereby determine if the compound alters the interaction between the first and second amino acid.
In one preferred embodiment, the first amino acid sequence is a drug candidate, e.g. a random peptide, a randomized or mutated scaffold protein, or a secreted protein (e.g., a cell surface protein, an ectodomain of a transmembrane protein, an antibody, or a polypeptide hormone); and the second amino acid sequence is a drug target. A first amino acid sequence at an address where an interaction between the first amino acid sequence and the second amino acid is detected can be used as a candidate amino acid sequence for additional refinement or as a drug. The first amino acid sequence can be administered to a subject. A nucleic acid encoding the first amino acid sequence can be administered to a subject. In a related preferred embodiment, the first amino acid sequence is the drug target, and the second amino acid sequence is the drug candidate.
In a preferred embodiment, each first amino acid sequence in the plurality of addresses is unique. For example, a first amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the first amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other first amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the first nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the first nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the first nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The first and/or second nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the first and/or second nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The first and/or second nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the first and/or second nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the first and/or second nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The first nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The first and/or second nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The first and/or second amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The first and/or second nucleic acid sequences encoding the first and/or second amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The first and/or second nucleic acid sequences can be nucleic acids expressed in a tissue, e.g., a normal or diseased tissue. The first and/or second amino acid sequences can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, they are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches).
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate). In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In another aspect the invention features a method of identifying an enzyme substrate or cofactor. The method includes: (1) providing a substrate with a plurality of addresses, each address of the plurality comprising (i) a first nucleic acid encoding a hybrid amino acid sequence comprising a first amino acid sequence and an affinity tag, (ii) a binding agent that recognizes the affinity tag and is attached to the substrate, and (iii) a second nucleic acid encoding an enzyme; (2) contacting each address of the plurality with a translation effector to thereby translate the first and second nucleic acid sequences; (3) maintaining the substrate under conditions permissive for the hybrid amino acid sequence to bind binding agent and for activity of the enzyme; (4) detecting the activity of the enzyme at each address of the plurality.
In one embodiment, the first amino acid sequence varies among the addresses of the plurality. In another embodiment, the second nucleic acid varies among the addresses of the plurality. The method can further include contacting each address of the plurality with an enzyme substrate (e.g., radioactive or otherwise labeled such as with ATP, GTP, s-adenosylmethionine, ubiquitin, and so forth) or a cofactor, e.g., NADH, NADPH, FAD. A substrate or cofactor can be provided with the translation effector.
The detecting step can include monitoring a protein bound by the labeled binding agent (radioactive or otherwise), e.g., after a wash step. The label can be present in solution (e.g., as a cofactor or reaction substrate) and can be transferred to first amino acid sequence by the enzyme, e.g., such that the label is covalently attached to the first amino acid sequence (e.g., such as in phosphorylation). The label can be present in solution and can be bound to the first amino acid sequence (e.g., non-covalently) as a result of an enzyme catalyzed or assisted reaction (e.g., the enzyme can effect a conformational change in the first amino acid sequence, such as a GTP exchange factor protein acting on a GTP binding protein).
In one preferred embodiment, the first amino acid sequence is common to all addresses of the plurality, and a second test amino acid sequence is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, the first amino acid sequence is unique among all the addresses of the plurality, and the second amino acid sequence is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
In a preferred embodiment, each first amino acid sequence in the plurality of addresses is unique. For example, a first amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the first amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other first amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the first nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the first nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the first nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The first and/or second nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the first and/or second nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The first and/or second nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the first and/or second nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the first and/or second nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The first nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The first and/or second nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The first and/or second amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The first and/or second nucleic acid sequences encoding the first and/or second amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The first and/or second nucleic acid sequences can be nucleic acids expressed in a tissue, e.g., a normal or diseased tissue. The first and/or second amino acid sequences can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, they are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches).
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate). In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
In another aspect, the invention features a method of producing a protein-interaction map for a plurality of amino acid sequences. The method includes: (1) providing (i) a first plurality of nucleic acid sequences, each encoding an amino acid sequence comprising an amino acid sequence of the plurality of amino acid sequences and an affinity tag; (ii) a second plurality of nucleic acid, each encoding an amino acid sequence comprising an amino acid sequence of the plurality of amino acid sequences and recognition tag; and (iii) a substrate with a plurality of addresses and a binding agent that binds the affinity tag and is attached to the substrate; (2) disposing on the substrate, at each address of the plurality of addresses, a nucleic acid of the first plurality and a nucleic acid of the second plurality; (3) contacting each address of the plurality of addresses with a translation effector to thereby translate the first and second nucleic acid sequences; (4) maintaining the substrate under conditions permissive for the affinity tag to bind binding agent; (5) optionally washing the substrate to remove the translation effector and unbound polypeptides; and (6) detecting the recognition tag at each address of the plurality.
In a preferred embodiment, all possible pairs of amino acid sequences from the plurality of amino acid sequences are present on the array.
Also featured is a database, e.g., in computer memory or a computer readable medium. Each record of the database can include a field for the amino acid sequence encoded by the first nucleic acid sequence, a field for the amino acid sequence encoded by the second nucleic acid sequence, and a field representing the result (e.g., a qualitative or quantitative result) of detecting the recognition tag in the aforementioned method. The database can include a record for each address of the plurality present on the array. Further the database can include a descriptor or reference for the physical location of the nucleic acid sequence on the array. The records can be clustered or have a reference to other records (e.g., including hierarchical groupings) based on the result.
Also featured is a method of providing tagged polypeptides. The method includes: (1) providing a substrate with a plurality of addresses, each address of the plurality comprising (i) a nucleic acid encoding an amino acid sequence comprising a test amino acid sequence and an affinity tag, and (ii) a particle attached to a binding agent that recognizes the affinity tag; (2) contacting each address of the plurality with a translation effector to thereby translate the amino acid sequence; and (3) maintaining the substrate under conditions permissive for the amino acid sequence to contact the binding agent.
In one preferred embodiment, the nucleic acid sequence is also attached to the particle.
In another preferred embodiment, the particle, e.g., a bead or nanoparticle, further contains information encoding its identity, e.g., a reference to the address on which it is disposed. The particle can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The particles can be disposed on the substrate such that they can be removed for later analysis. In one embodiment, multiple particles with the same identifier are disposed at each address of the plurality. The particles can be collected after translation and attachment of the amino acid sequence. The particles can then be subdivided into aliquots. A particle with a given property, e.g., the ability to bind a labeled compound can be identified. The identity of the particle can be determined to thereby identify the amino acid sequence attached to the particle.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acid sequences encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The test amino acid sequences can be genes expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In another aspect, the invention features a method of providing tagged polypeptides. The method includes: providing a substrate with a plurality of addresses, each address of the plurality having a nucleic acid (i) encoding an amino acid sequence comprising: (1) a test amino acid sequence, and (2) a tag; and (ii) a handle; contacting each address of the plurality with a translation effector to thereby translate the nucleic acid sequence; and maintaining the substrate under conditions permissive for the tag to contact the handle to thereby form a complex of the nucleic acid and the test polypeptide having the test amino acid sequence.
In one embodiment, the handle is biotin, and the tag is avidin. For example, the nucleic acid has a biotin covalent attached to a nucleotide. The nucleic acid can be formed by amplification of a template nucleic acid using a synthetic oligonucleotide having a biotin moiety covalently attached at its 5xe2x80x2 end. In another embodiment, the handle is glutathione, and the tag is glutathione-S-transferase. For example, the nucleic acid has a glutathione moiety covalent attached to a nucleotide. The nucleic acid can be formed by amplification of a template nucleic acid using a synthetic oligonucleotide having a biotin moiety covalently attached at its 5xe2x80x2 end.
In one embodiment, the handle includes a keto group, and the tag is a hydrazine. A covalent bond is formed between the handle and tag.
The method can further includes combining the complexes formed at all the addresses into a pool, selecting a polypeptide from the pool, and amplifying the complexed nucleic acid sequence to thereby identify the selected amino acid sequence.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can be an RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the first tag. The second tag can be C-terminal to the test amino acid sequence and the first tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the first tag can be C-terminal to the test amino acid sequence; the second tag and the first tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acid sequences encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The test amino acid sequences can be genes expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
The handle can be attached to the substrate. For example, the substrate can be derivatized and the handle covalent attached thereto. The handle can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the handle is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the handle is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
The invention also features a kit which includes: (1) an array comprising a plurality of addresses, wherein each address of the plurality comprises a handle and (2) a vector nucleic acid comprising (i) a promoter; (ii) an entry site; and (iii) a tag encoding sequence, wherein the tag can be attached to the handle.
The vector nucleic acid can include one or more sites for insertion of a test amino acid sequence (e.g., a recombination site or a restriction site), and a sequence encoding an tag. In a preferred embodiment, the vector nucleic acid has two sites for insertion, and a toxic gene inserted between the two sites. In another embodiment, the sites for insertion are homologous recombination or site-specific recombination sites, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, one or both recombination sites lack stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, one or both recombination sites include a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In a much preferred embodiment, the tag is in frame with the translation frame of a nucleic acid sequence (e.g., a sequence to be inserted) encoding a test amino acid sequence. In a preferred embodiment, the tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and tag can be amino-terminal or carboxy-terminal to the test amino acid sequence. The cleavage site can be a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
In one embodiment, the handle includes a keto group, and the tag is a hydrazine. A covalent bond is formed between the handle and tag. The kit can further include an unnatural amino acid having a keto group, e.g., a reactable keto group on a side chain. The kit can also further include a tRNA, and optionally a tRNA synthetase for amino-acylating the tRNA with the unnatural amino acid. The tRNA can be a stop codon suppressing tRNA.
In a preferred embodiment, the kit also includes at least a second vector nucleic acid. The second vector nucleic acid can include one or more sites for insertion of a test amino acid sequence (e.g., a recombination site or a restriction site).
In another embodiment, the kit also includes multiple nucleic acids encoding unique test amino acid sequences. These encoding nucleic acids can be flanked, e.g., on both ends by a site, e.g., a site compatible with the vector nucleic acid (e.g., having sequence for recombination with a sequence in the vector; or having a restriction site which leaves an overhang or blunt end such that the overhang or blunt end can be ligated into the vector nucleic acid (e.g., the restricted vector nucleic acid)).
In another preferred embodiment, the kit also includes a transcription effector and/or a translation effector.
In a preferred embodiment, the second vector nucleic acid has a recognition tag, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof).
The first and/or second vector nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter.
In a preferred embodiment, the kit also includes a recombinase, a ligase, and/or a restriction endonuclease. For example, the recombinase can mediate recombination, e.g., site-specific recombination or homologous recombination, between a recombination site on the test nucleic acid and a recombination sequence on the vector nucleic acid. For example, the recombinase can be lambda integrase, HIV integrase, Cre, or FLP recombinase.
In a preferred embodiment, each address of the plurality has a handle capable of recognizing the tag. The handle can be attached to the substrate. For example, the substrate can be derivatized and the handle covalent attached thereto. The handle can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the handle is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, the array of the kit includes an insoluble substrate (e.g., a bead or particle), disposed at each address of the plurality, and the handle is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
The first or second vector nucleic acid can include a sequence encoding a second polypeptide tag in addition to the tag. The second tag can be C-terminal to the test amino acid sequence and the tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the tag can be C-terminal to the test amino acid sequence; the second tag and the tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first tag. Each polypeptide tag of the plurality can be the same as or different from the first tag.
The first or second vector nucleic acid sequence can further include a sequence encoding a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acids encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The encoding nucleic acids can be nucleic acids (e.g., an mRNA or cDNA) expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides (i.e., test amino acid sequences) can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
The kit can further include software and/or a database, e.g., in computer memory or a computer readable medium (e.g., a CD-ROM, a magnetic disc, flash memory. Each record of the database can include a field for the test amino acid sequence encoded by the nucleic acid sequence and a descriptor or reference for the physical location of the encoding nucleic acid sequence in the kit, e.g., location in a microtitre plate. Optionally, the record also includes a field representing a result (e.g., a qualitative or quantitative result) of detecting the polypeptide encoded by the nucleic acid sequence. The database can include a record for each address of the plurality present on the array. The records can be clustered or have a reference to other records (e.g., including hierarchical groupings) based on the result. The software can contain computer readable code to configure a computer-controlled robotic apparatus to manipulate nucleic acids encoding test amino acid sequences and vector nucleic acids in order to insert the encoding nucleic acids into the vector nucleic acids and further to manipulate the insertion products onto addresses of the array.
The kit can also include instructions for use of the array or a link or indication of a network resource (e.g., a web site) having instructions for use of the array or the above database of records describing the addresses of the array.
A method of providing an array includes providing the aforementioned kit, and a plurality of nucleic acid sequences, each encoding a unique test amino acid sequence and an excision site. The method further includes removing each of the plurality of nucleic acid sequence from the excision site and inserting it into the entry site of the vector nucleic acid to thereby generate a test nucleic acid sequence encoding a test polypeptide comprising the test amino acid sequence and the tag; and disposing each of the plurality of test nucleic acid sequences at an address of the array.
Another featured kit includes: an array comprising a substrate having a plurality of addresses, wherein each address of the plurality comprises a handle, and a nucleic acid sequence encoding an amino acid sequence comprising: (a) a test amino acid sequence, and (b) a tag. The kit can optionally further include at least one of: a translation effector and a transcription effector.
The nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The nucleic acid sequence can further include a sequence encoding a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acids encoding the test amino acid sequences can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The encoding nucleic acids can be nucleic acids (e.g., an mRNA or cDNA) expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides (i.e., test amino acid sequences) can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the test amino acid sequences on half the addresses of an array are from a diseased tissue or a first species, whereas the sequences on the remaining half are from a normal tissue or a second species.
In a preferred embodiment, each address of the plurality further includes one or more second nucleic acids, e.g., a plurality of unique nucleic acids. Hence, the plurality in toto can encode a plurality of test sequences. For example, each address of the plurality can encode a pool of test polypeptide sequences, e.g., a subset of a library or clone bank. A second array can be provided in which each address of the plurality of the second array includes a single or subset of members of the pool present at an address of the first array. The first and the second array can be used consecutively.
In other preferred embodiments, each address of the plurality further includes a second nucleic acid encoding a second amino acid sequence.
In one preferred embodiment, each address of the plurality includes a first test amino acid sequence that is common to all addresses of the plurality, and a second test amino acid sequence that is unique among all the addresses of the plurality. For example, the second test amino acid sequences can be query sequences whereas the first amino test amino acid sequence can be a target sequence. In another preferred embodiment, each address of the plurality includes a first test amino acid sequence that is unique among all the addresses of the plurality, and a second test amino acid sequence that is common to all addresses of the plurality. For example, the first test amino acid sequences can be query sequences whereas the second amino test amino acid sequence can be a target sequence. The second nucleic acid encoding the second test amino acid sequence can include a sequence encoding a recognition tag and/or an affinity tag.
At at least one address of the plurality, the first and second amino acid sequences can be such that they interact with one another. In one preferred embodiment, they are capable of binding to each other. The second test amino acid sequence is optionally fused to a detectable amino acid sequence, e.g., an epitope tag, an enzyme, a fluorescent protein (e.g., GFP, BFP, variants thereof). The second test amino acid sequence can be itself detectable (e.g., an antibody is available which specifically recognizes it). In another preferred embodiment, one is capable of modifying the other (e.g., making or breaking a bond, preferably a covalent bond, of the other). For example, the first amino acid sequence is kinase capable of phosphorylating the second amino acid sequence; the first is a methylase capable of methylating the second; the first is a ubiquitin ligase capable of ubiquitinating the second; the first is a protease capable of cleaving the second; and so forth.
Kits of these embodiments can be used to identify an interaction or to identify a compound that modulates, e.g., inhibits or enhances, an interaction.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
The kit can further include a database, e.g., in computer memory or a computer readable medium (e.g., a CD-ROM, a magnetic disc, flash memory. Each record of the database can include a field for the amino acid sequence encoded by the nucleic acid sequence and a descriptor or reference for the physical location of the nucleic acid sequence on the array. Optionally, the record also includes a field representing a result (e.g., a qualitative or quantitative result) of detecting the polypeptide encoded by the nucleic acid sequence. The database can include a record for each address of the plurality present on the array. The records can be clustered or have a reference to other records (e.g., including hierarchical groupings) based on the result.
The kit can also include instructions for use of the array or a link or indication of a network resource (e.g., a web site) having instructions for use of the array or the above database of records describing the addresses of the array.
In another aspect, the invention features a method of providing an array across a network, e.g., a computer network, or a telecommunications network. The method includes: providing a substrate comprising a plurality of addresses, each address of the plurality having a binding agent; providing a plurality of nucleic acid sequences, each nucleic acid sequence comprising a sequence encoding a test amino acid sequence and an affinity tag that is recognized by the binding agent; providing on a server a list of either (i) nucleic acid sequences of the plurality or (ii) subsets of the plurality (e.g., categorized groups of sequences); transmitting the list across a network to a user; receiving at least one selection of the list from the user; disposing the one or more nucleic acid sequence corresponding to the selection on an address of the plurality; and providing the substrate to the user.
In one embodiment, each nucleic acid sequence is disposed at a unique address. For example, if a subset is selected, each nucleic acid sequence of the subset is disposed at a unique address. In another embodiment, a plurality of nucleic acid sequences are disposed at each address.
The method can further include contacting each address of the plurality with one or more of (i) a transcription effector, and (ii) a translation effector. Optionally, the substrate is maintained under conditions permissive for the amino acid sequence to bind the binding agent. One or more addresses can then be washed, e.g., to remove at least one of (i) the nucleic acid, (ii) the transcription effector, (iii) the translation effector, and/or (iv) an unwanted polypeptide, e.g., an unbound polypeptide or unfolded polypeptide. The array can optionally be contacted with a compound, e.g., a chaperone; a protease; a protein-modifying enzyme; a small molecule, e.g., a small organic compound (e.g., of molecular weight less than 5000, 3000, 1000, 700, 500, or 300 Daltons); nucleic acids; or other complex macromolecules e.g., complex sugars, lipids, or matrix molecules.
The array can be further processed, e.g., prepared for storage. It can be enclosed in a package, e.g., an air- or water-resistant package. The array can be desiccated, frozen, or contacted with a storage agent (e.g., a cryoprotectant, an anti-bacterial, an anti-fungal). For example, an array can be rapidly frozen after being optionally contacted with a cryoprotectant. This step can be done at any point in the process (e.g., before or after contacting the array with an RNA polymerase; before or after contacting the array with a translation effector; or before or after washing the array). The packaged product can be supplied to a user with or without additional contents, e.g., a transcription effector, a translation effector, a vector nucleic acid, an antibody, and so forth.
In a preferred embodiment, each test amino acid sequence in the plurality of addresses is unique. For example, a test amino acid sequence can differ from all other test amino acid sequence of the plurality by 1, or more amino acid differences, (e.g., about 2, 3, 4, 5, 8, 16, 32, 64 or more differences; and, by way of example, has about 800, 256, 128, 64, or 32, 16, 8, 4, or fewer differences). In another preferred embodiment, the test amino acid sequence encoded by the nucleic acid at each address of the plurality is identical to all other test amino acid sequences in the plurality of addresses. In a preferred embodiment, the affinity tag encoded by the nucleic acid at each address of the plurality is the same, or substantially identical to all other affinity tags in the plurality of addresses. In another preferred embodiment, the nucleic acid at each address of the plurality encodes more than one affinity tag. In yet another preferred embodiment, the affinity tag encoded by the nucleic acid at an address of the plurality differs from at least one other affinity tag in the plurality of addresses.
In a preferred embodiment, the affinity tag is fused directly to the test amino acid sequence, e.g., directly amino-terminal, or directly carboxy-terminal. In another preferred embodiment, the affinity tag is separated from the test amino acid by one or more linker amino acids, e.g., 1, 2, 3, 4, 5, 6, 8, 10, 12, 20, 30 or more amino acids, preferably about 1 to 20, or about 3 to 12 amino acids. The linker amino acids can include a cleavage site, flexible amino acids (e.g., glycine, alanine, or serine, preferably glycine), and/or polar amino acids. The linker and affinity tag can be amino-terminal or carboxy-terminal to the test amino acid sequence.
The nucleic acid can be a RNA, or a DNA (e.g., a single-stranded DNA, or a double stranded DNA). In a preferred embodiment, the nucleic acid includes a plasmid DNA or a fragment thereof; an amplification product (e.g., a product generated by RCA, PCR, NASBA); or a synthetic DNA.
The nucleic acid can further include one or more of: a transcription promoter; a transcription regulatory sequence; a untranslated leader sequence; a sequence encoding a cleavage site; a recombination site; a 3xe2x80x2 untranslated sequence; a transcriptional terminator; and an internal ribosome entry site. In one embodiment, the nucleic acid sequence includes a plurality of cistrons (also termed xe2x80x9copen reading framesxe2x80x9d), e.g., the sequence is dicistronic or polycistronic. In another embodiment, the nucleic acid also includes a sequence encoding a reporter protein, e.g., a protein whose abundance can be quantitated and can provide an indication of the quantity of test polypeptide fixed to the plate. The reporter protein can be attached to the test polypeptide, e.g., covalently attached, e.g., attached as a translational fusion. The reporter protein can be an enzyme, e.g., xcex2-galactosidase, chloramphenicol acetyl transferase, xcex2-glucuronidase, and so forth. The reporter protein can produce or modulate light, e.g., a fluorescent protein (e.g., green fluorescent protein, variants thereof, red fluorescent protein, variants thereof, and the like), and luciferase.
The transcription promoter can be a prokaryotic promoter, a eukaryotic promoter, or a viral promoter. In a preferred embodiment, the promoter is the T7 RNA polymerase promoter. The regulatory components, e.g., the transcription promoter, can vary among nucleic acids at different addresses of the plurality. For example, different promoters can be used to vary the amount of polypeptide produced at different addresses.
In one embodiment, the nucleic acid also includes at least one site for recombination, e.g., homologous recombination or site-specific recombination, e.g., a lambda att site or variant thereof; a lox site; or a FLP site. In a preferred embodiment, the recombination site lacks stop codons in the reading frame of a nucleic acid encoding a test amino acid sequence. In another preferred embodiment, the recombination site includes a stop codon in the reading frame of a nucleic acid encoding a test amino acid sequence.
In another embodiment, the nucleic acid includes a sequence encoding a cleavage site, e.g., a protease site, e.g., a site cleaved by a site-specific protease (e.g., a thrombin site, an enterokinase site, a PreScission site, a factor Xa site, or a TEV site), or a chemical cleavage site (e.g., a methionine, preferably a unique methionine (cleavage by cyanogen bromide) or a proline (cleavage by formic acid)).
The nucleic acid can include a sequence encoding a second polypeptide tag in addition to the affinity tag. The second tag can be C-terminal to the test amino acid sequence and the affinity tag can be N-terminal to the test amino acid sequence; the second tag can be N-terminal to the test amino acid sequence, and the affinity tag can be C-terminal to the test amino acid sequence; the second tag and the affinity tag can be adjacent to one another, or separated by a linker sequence, both being N-terminal or C-terminal to the test amino acid sequence. In one embodiment, the second tag is an additional affinity tag, e.g., the same or different from the first tag. In another embodiment, the second tag is a recognition tag. For example, the recognition tag can report the presence and/or amount of test polypeptide at an address. Preferably the recognition tag has a sequence other than the sequence of the affinity tag. In still another embodiment, a plurality of polypeptide tags (e.g., less than 3, 4, 5, about 10, or about 20 tags) are encoded in addition to the first affinity tag. Each polypeptide tag of the plurality can be the same as or different from the first affinity tag.
The nucleic acid sequence can further include an identifier sequence, e.g., a non-coding nucleic acid sequence, e.g., one that is synthetically inserted, and allows for uniquely identifying the nucleic acid sequence. The identifier sequence can be sufficient in length to uniquely identify each sequence in the plurality; e.g., it is about 5 to 500, 10 to 100, 10 to 50, or about 10 to 30 nucleotides in length. The identifier can be selected so that it is not complementary or identical to another identifier or any region of each nucleic acid sequence of the plurality on the array.
The test amino acid sequence can further include a protein splicing sequence or intein. The intein can be inserted in the middle of a test amino acid sequence. The intein can be a naturally-occurring intein or a mutated intein.
The nucleic acid sequences of the plurality can be obtained from a collection of full-length expressed genes (e.g., a repository of clones), a cDNA library, or a genomic library. The test amino acid sequences can be genes expressed in a tissue, e.g., a normal or diseased tissue. The test polypeptides can be mutants or variants of a scaffold protein (e.g., an antibody, zinc-finger, polypeptide hormone etc.). In yet another embodiment, the test polypeptides are random amino acid sequences, patterned amino acids sequences, or designed amino acids sequences (e.g., sequence designed by manual, rational, or computer-aided approaches). The plurality of test amino acid sequences can include a plurality from a first source, and plurality from a second source. For example, the server can be provided with lists of test amino acid sequences associated with a diseased tissue or a first species in addition to lists of test amino acid sequences associated with a normal tissue or a second species.
The binding agent can be attached to the substrate. For example, the substrate can be derivatized and the binding agent covalent attached thereto. The binding agent can be attached via a bridging moiety, e.g., a specific binding pair. (e.g., the substrate contains a first member of a specific binding pair, and the binding agent is linked to the second member of the binding pair, the second member being attached to the substrate).
In yet another embodiment, an insoluble substrate (e.g., a bead or particle), is disposed at each address of the plurality, and the binding agent is attached to the insoluble substrate. The insoluble substrate can further contain information encoding its identity, e.g., a reference to the address on which it is disposed. The insoluble substrate can be tagged using a chemical tag, or an electronic tag (e.g., a transponder). The insoluble substrate can be disposed such that it can be removed for later analysis.
The invention also features a computer system including (i) a server storing a list of amino acid sequences and/or their descriptors, and (ii) software configured to: (1) send a list of amino acid sequence and/or their descriptors to a client; (2) receive from the client a plurality of selected amino acid sequences from the list; and (3) interface with an array provider (e.g., a robotic system, or a technician) so as to dispose on a substrate nucleic acids encoding the selected amino acid sequences, each at a plurality of addresses.
The invention also features a method of identifying a small molecule or drug binding protein. Such proteins can include drug targets and adventitious drug-binding proteins (e.g., non-target proteins responsible for toxicity of a drug). The method includes providing or obtaining an array described herein, contacting each address of the plurality with a drug, e.g., a labeled drug. The method can further include detecting the presence of the drug at each address of the plurality. The method can also include a wash step, e.g., prior to the detecting.
The term xe2x80x9carray,xe2x80x9d as used herein, refers to an apparatus with a plurality of addresses.
A xe2x80x9cnucleic acid programmable polypeptide arrayxe2x80x9d or xe2x80x9cNAPPAxe2x80x9d refers to an array described herein. The term encompasses such an array at any stages of production, e.g., before any nucleic acid or polypeptide is present; when nucleic acid is disposed on the array, but no polypeptide is present; when a nucleic acid has been removed and a polypeptide is present; and so forth.
The term xe2x80x9caddress,xe2x80x9d as referred to herein, is a positionally distinct portion of a substrate. Thus, a reagent at a first address can be positionally distinguished from a reagent at a second address. The address is located in and/or on the substrate. The address can be distinguished by two coordinates (e.g., x-y) in embodiments using two-dimensional arrays, or by three coordinates (e.g., x-y-z) in embodiments using three-dimensional arrays.
The term xe2x80x9csubstrate,xe2x80x9d as used herein in the context of arrays (as opposed to a substrate of an enzyme), refers to a composition in or on which a nucleic acid or polypeptide is disposed. The substrate may be discontinuous. An illustrative case of a discontinuous substrate is a set of gel pads separated by a partition.
The terms xe2x80x9ctest amino acid sequencexe2x80x9d or xe2x80x9ctest polypeptide,xe2x80x9d as used herein, refers to a polypeptide of at least three amino acids that is translated on the array. The test amino acid sequence may or may not vary among the addresses of the array.
The term xe2x80x9ctranslation effectorxe2x80x9d refers to a macromolecule capable of decoding a messenger RNA and forming peptide bonds between amino acids. The term encompasses ribosomes, and catalytic RNAs with the aforementioned property. A translation effector can optionally further include tRNAs, tRNA synthases, elongation factors, initiation factors, and termination factors. An example of a translation effector is a translation extract obtained from a cell.
As used herein, the term xe2x80x9ctranscription effectorxe2x80x9d refers to a composition capable of synthesizing RNA from an RNA or DNA template, e.g., a RNA polymerase.
The term xe2x80x9crecognizes,xe2x80x9d as used herein, refers to the ability of a first agent to bind to a second agent. Preferably, the dissociation constant or apparent dissociation constant of binding is about 100 xcexcM, 10 xcexcM, 1 xcexcM, 100 nM, 10 nM, 1 nM, 100 pM, 10 pM, or less.
The term xe2x80x9caffinity tag,xe2x80x9d as used herein, refers to an amino acid, a peptide sequence, or a polypeptide sequence that includes a moiety capable of recognizing or reacting with a binding agent.
The term xe2x80x9cbinding agent,xe2x80x9d as used herein, refers to a moiety, either a biological polymer (e.g., polypeptide, polysaccharide, or nucleic acid, or another chemical compound which is capable of recognizing or binding an affinity tag or which is capable of specifically reacting with an affinity tag, e.g., to form a covalent bond. The term xe2x80x9chandlexe2x80x9d is used synonymously with binding agent.
The term xe2x80x9crecognition tag,xe2x80x9d as used herein, refers to an amino acid, a peptide sequence, or a polypeptide sequence that can be detected, directly or indirectly, on the array.
As used herein, the terms xe2x80x9cpeptide,xe2x80x9d xe2x80x9cpolypeptide,xe2x80x9d and xe2x80x9cproteinxe2x80x9d are used interchangeably. Generally, these terms refer to polymers of amino acids which are at least three amino acids in length.
A xe2x80x9cunique reagentxe2x80x9d refers to a reagent that differs from a reagent at each other address in a plurality of addresses. The reagent can differ from the reagents at other addresses in terms of one or both of: structure and function. A unique reagent can be a molecule, e.g., a biological macromolecule (e.g., a nucleic acid, a polypeptide, or a carbohydrate), a cell, or a small organic compound. In the case of biological polymers, a structural difference can be a difference in sequence at at least one position. In addition, a structural difference, e.g., for polymers having the same sequence, can be a difference in conformation (e.g., due to allosteric modification; meta-stable folding; alternative native folded states; prion or prion-like properties) or a modification (e.g., covalent and non-covalent modifications (e.g., a bound ligand)).
Protein microarrays representing many different proteins, as described herein, provide a potent high-throughput tool which can greatly accelerate the study of protein function. The arrays described herein avoids the process of expressing proteins in living cells, purifying, stabilizing, and spotting them. NAPPA arrays, as described herein, also reduce the number of manipulations for each polypeptide, as the polypeptide can be synthesized in situ in or on the array substrate. The current invention obviates the need to purify polypeptides and to manipulate purified protein samples onto the array by the straightforward and much simpler process of disposing nucleic acids. The nucleic acids are then simultaneously transcribed/translated in a cell-free system and immobilized in situ, minimizing direct manipulation of the proteins and making this approach well suited to high-throughput applications. Further, the cotranslation of a first and second polypeptide can enhance complex formation in some cases.
In addition, the protein folding environment in cell free systems differs from the natural environment, allowing for a user to control a variety of parameters such as post-translational modifications.
The array can be easily reprogrammed to contain different sets of proteins and polypeptides.
Polypeptide arrays provide comprehensive genome-wide screens for biomolecular interactions. The arrays, as described herein, allow for the sampling of an entire library. Detecting each address of a plurality provides the certainty that each library member has been screened. Thus, complete coverage of known sequences is possible. For example, a single array containing 10,000 arrayed elements, for example, can be sufficient to yield 10,000 results (e.g., quantitative results), each result comparable with the results of other elements of the array, and potentially with a result from other arrays. High-density arrays further expand possible coverage.
Some embodiments described herein also provide arrays and methods for detecting subtle and sensitive results. As a polypeptide species, e.g., a homogenous species, can be provided at an address without competing species, a result for the individual species can be detected. In other embodiments, arrays and methods can also including competing species for the very purpose of removing subtle results and increasing the signal of strong positives.
In sum, the arrays and methods described herein provide a versatile new platform for proteomics.