The invention relates to nucleic acid protein arrays.
Compact arrays or libraries of surface-bound, double-stranded oligonucleotides are of use in rapid, high-throughput screening of proteins to identify those that bind, or otherwise interact with, short, double-stranded DNA sequence motifs. Of particular interest are trans-regulatory factors that control gene transcription. Ideally, such an oligonucleotide array is bound to the surface of a solid support matrix that is of a size that enables laboratory manipulations, e.g. an incubation of a candidate protein with the nucleic acid sequences thereon, and that is itself inert to chemical interactions with experimental proteins, buffers and/or other components. In addition, it is desirable that the absolute number of unique nucleic acid sequences in the array be maximized, since methods of high-throughput screening are used in the attempt to minimize repetition of steps that are labor-intensive or otherwise costly.
A high-density, double-stranded DNA array complexed to a solid matrix is described by Lockhart (U.S. Pat. No.: 5,556,752); however, the DNA molecules therein disclosed are produced as unimolecular products of chemical synthesis. As synthesized, each member of the array contains regions of self-complementarity separated by a spacer (i.e. a single-strand loop), such that these regions hybridize to each other in order to produce a double-helical region. Further, it is required that those regions of complementary nucleic acid sequences that must hybridize in order to form the double-helical structure are physically attached to each other by a linker subunit.
The invention provides a synthetic array of surface-bound, bimolecular, double-stranded nucleic acid molecules, the array comprising a solid support and a plurality of bimolecular double-stranded nucleic acid molecule members, a member comprising a first nucleic acid strand linked to the solid support and a second nucleic acid strand which is substantially complementary to the first strand and complexed to the first strand by Watson-Crick base pairing, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein, wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member and wherein a protein is bound to a member thereof.
The term xe2x80x9csyntheticxe2x80x9d, as used herein, is defined as that which is produced by in vitro chemical or enzymatic synthesis. The synthetic arrays of the present invention may be contrasted with natural nucleic acid molecules such as viral or plasmid vectors, for instance, which may be propagated in bacterial, yeast, or other living hosts.
As used herein, the term xe2x80x9cnucleic acidxe2x80x9d is defined to encompass DNA and RNA or both synthetic and natural origin. The nucleic acid may exist as single- or double-stranded DNA or RNA, an RNA/DNA heteroduplex or an RNA/DNA copolymer, wherein the term xe2x80x9ccopolymerxe2x80x9d refers to a single nucleic acid strand that comprises both ribonucleotides and deoxyribonucleotides.
As used herein, the term xe2x80x9cbimolecularxe2x80x9d refers to the fact that the 5xe2x80x2 end of the first strand and 3xe2x80x2 end of the second strand are not linked via a covalent bond, and thus do not form a continuous single strand. As used herein in this context, xe2x80x9ccovalent bondxe2x80x9d is defined as meaning a bond that forms, directly or via a spacer comprising nucleic acid or another material, a continuous strand that comprises the 5xe2x80x2 end of the first strand and the 3xe2x80x2 end of the second strand, and thus includes a 3xe2x80x2/5xe2x80x2 phosphate bond as occurs naturally in a single-stranded nucleic acid. This definition does not encompass intermolecular crosslinking of the first and second strands.
When used herein in this context, the term xe2x80x9cdouble-strandedxe2x80x9d refers to a pair of nucleic acid molecules, as defined above, that exist in a hydrogen-bonded, helical array typically associated with DNA, and that under these umbrella terms are included those paired oligonucleotides that are essentially double-stranded, meaning those that contain short regions of mismatch, such as a mono-, di- or tri-nucleotide, resulting from design or error either in chemical synthesis of the oligonucleotide priming site on the first nucleic acid strand or in enzymatic synthesis of the second nucleic acid strand; it is contemplated that at least a portion of the members of the array have a second nucleic acid strand which is substantially complementary to- and base paired with the first strand along the entire length of the first strand.
As used herein, the terms xe2x80x9ccomplementaryxe2x80x9d and xe2x80x9csubstantially complementaryxe2x80x9d refer to the hybridization or base pairing between nucleotides or nucleic acids, such as, for instance, between the two strands of a double-stranded DNA molecule or between an oligonucleotide primer and a primer binding site on a single stranded nucleic acid to be sequenced or amplified. Complementary nucleotides are, generally, A and T (or A and U), or C and G. Typically, sequences which are complementary will hybridize to each other under stringent conditions. Stringent hybridization conditions will typically include salt concentrations of less than about 1M, more usually less than about 500 mM, and preferably less than about 200 mM. Alternatively, stringent hybridization conditions typically include at least 10% formamide, preferably 20% and more preferably 40%. Hybridization temperatures can be as low as 5xc2x0 C., but are typically greater than 22xc2x0 C., more typically greater than about 30xc2x0 C., and preferably in excess of about 37xc2x0 C. Longer fragments may require higher hybridization temperatures for specific hybridization, while those that are rich in dA and dT may require lower temperatures. Two single-stranded RNA or DNA molecules are said to be substantially complementary when the nucleotides of one strand, optimally aligned and compared and with appropriate nucleotide insertions or deletions, pair with at least about 80% of the nucleotides of the other strand, usually at least about 90% to 95%, and more preferably from about 98 to 100%. Sequences that are substantially complementary may hybridize under stringent conditions; however, it is usually necessary to raise the concentration of salt, or lower the concentration of formamide or the hybridization temperature.
As used herein in reference to nucleic acid members of an array, the term xe2x80x9cportionxe2x80x9d refers to at least two members of an array. Preferably, a portion refers to a number of individual members of an array, such as at least 60%, 80%, 90% and 95-100% of such members.
As used herein, the terms xe2x80x9crecognition site for a proteinxe2x80x9d and xe2x80x9crecognition site within a nucleic acid sequence for a proteinxe2x80x9d refers to a nucleic acid sequence which is recognized and/or bound by a protein.
As used herein with regard to recognition sites within a nucleic acid sequence for a protein, the term xe2x80x9cdifferentxe2x80x9d refers to two or more nucleic acid sequences which are recognized and/or bound by a protein or proteins, which recognition sites within a nucleic acid sequence for a protein differ in the identity of at least one nucleotide.
As used herein, the term xe2x80x9carrayxe2x80x9d is defined to mean a heterogeneous pool of nucleic acid molecules that is affixed to a solid support in a spatially-ordered manner, such as a Cartesian distribution (in other words, arranged at defined points along the x- and y axes of a grid or specific xe2x80x98clock positionsxe2x80x99 within- or degrees or radii from the center of a radial pattern) of nucleic acid molecules over the support, that permits identification of individual features during the course of experimental manipulation.
As used herein, the term xe2x80x9cfeaturexe2x80x9d refers to each nucleic acid sequence occupying a discrete physical location on the array; if a given sequence is represented at more than one such site, each site is classified as a feature. A feature comprises one or a plurality of individual, double-stranded, bimolecular nucleic acid molecule members; within a given feature, every such member represents the same sequence.
According to the invention, the array may have virtually any number of different features. In preferred embodiments, the array comprises from 2 up to 100 features, more preferably from 100 up to 10,000 features and highly preferably from 10,000 up to 1,000,000 features, preferably on a solid support. In preferred embodiments, the array will have a density of more than 100 features at known locations per cm2, preferably more than 1,000 per cm2, more preferably more than 10,000 per cm2.
According to the methods disclosed herein, a xe2x80x9csolid supportxe2x80x9d (or, simply, xe2x80x9csupportxe2x80x9d) is defined as a material having a rigid or semi-rigid surface to which nucleic acid molecules may be attached or upon which they may be synthesized.
It is contemplated that attached to the solid support is a spacer. The spacer molecule is preferably of sufficient length to permit the double-stranded oligonucleotide in the completed member of the array to interact freely with molecules exposed to the array. The spacer molecule, which may comprise as little as a covalent bond length, is typically 6-50 atoms long to provide sufficient exposure for the attached double-stranded DNA molecule. The spacer is comprised of a surface attaching portion and a longer chain portion.
It is preferred that the 3xe2x80x2 end of the first strand is linked to the support.
It is additionally preferred that the 5xe2x80x2 end of the first strand and the 3xe2x80x2 end of the second strand are not linked via a covalent bond.
Preferably, the 5xe2x80x2 end of the second strand is not linked to the support.
It is preferred that the recognition site within a nucleic acid sequence for a protein is selected from the group that includes naturally-occurring recognition sites within a nucleic acid sequence for a protein or proteins, synthetic variants of naturally-occurring recognition sites within a nucleic acid sequence for a protein or proteins and randomized nucleic acid sequences.
As used herein in reference to recognition sites within a nucleic acid sequence for a protein or proteins, the term xe2x80x9cnaturally-occurringxe2x80x9d refers to such sequences isolated from an organism, wherein those sequences are native to that species or strain of organism and are not the products of genetic engineering, e.g. synthetic sequences, whether transiently transfected or stably incorporated into the genome of a transgenic or transiently-transfected organism or one or more of its ancestor organisms.
As used herein, the term xe2x80x9callelic variantxe2x80x9d refers to a naturally-occuring nucleic acid sequence which is present in a subset of individuals (2-98%) of a population. Such a sequence may function properly (e.g. be recognized by the correct protein) or may be poorly- or non-functional. The term xe2x80x9cpoorly-functionalxe2x80x9d refers to a recognition site within a nucleic acid sequence for a protein which, for example, has lowered affinity for its corresponding protein or is recognized and bound by the wrong protein. In this context, a xe2x80x9cnon-functionalxe2x80x9d recognition site within a nucleic acid sequence for a protein would be expected to bind background levels of (essentially no) protein. Unless found in a majority of individuals in a population, the sequence of an allelic variant differs in at least one position relative to that of a consensus sequence, as defined below.
As used herein, the term xe2x80x9cmutant variantxe2x80x9d refers to a naturally-occurring nucleic acid sequence which occurs at a low frequency (less than 2%) in a population. As is true of an allelic variant, a mutant variant may function properly, poorly or not at all.
As used herein, the term xe2x80x9csynthetic variantxe2x80x9d refers to a nucleic acid sequence in which the identity of at least one nucleotide has been altered in vitro, such that it represents no naturally-occuring variant of the sequence upon which is is based. A synthetic variant may function properly, poorly or not at all.
As used herein with regard to individual nucleic acid sequences, the term xe2x80x9crandomizedxe2x80x9d refers to in vitro-synthesized sequences in which any nucleotide or ribonucleotide can be present at one, more than one or all positions; therefore, for such positions as are randomized, the sequence of the finished molecule is not predetermined, but is left to chance.
As used herein with regard to an array of the invention, the term xe2x80x9crandomizedxe2x80x9d refers to an array which is constructed such that, for a sequence of a recognition site within a nucleic acid sequence of a protein of a selected length (e.g. a hexamer), each possible nucleotide combination is comprised by a corresponding feature thereof. In order to realize a complete set of such nucleotide sequence permutations, it is necessary to specify fully the sequence of each feature during synthesis of the array; therefore, while such an array may be referred to as an xe2x80x9carray of randomized 6-mersxe2x80x9d the design of the array is entirely non-random.
One or more recognition sites within a nucleic acid sequence for a protein or proteins may be present in a given member nucleic acid of an array, wherein xe2x80x9cone or morexe2x80x9d refers to one, two, three, four, five and even up to 10-20 sites.
In a preferred embodiment, the recognition site within a nucleic acid sequence for a protein comprises two half-sites, wherein either is recognized by a different protein than is the other.
As used herein, the term xe2x80x9chalf-sitexe2x80x9d refers to a nucleic acid sequence which is recognized and bound by a targeting amino acid sequence present on one protein subunit of a dimeric protein complex. Neither subunit of the dimeric protein complex will bind its cognate half-site alone (i.e., unless dimerized to the other); therefore, either both half-sites are occupied by protein, or neither is. Both half sites of a recognition site within a nucleic acid sequence for a protein may be identical, whether arranged head-to-tail or as a palindrome (head-to-head or tail-to-tail); if in the latter configuration, the sequence of a recognition site within a nucleic acid sequence of a protein is said to have xe2x80x9cdyad symmetryxe2x80x9d. Typically, a recognition site within a nucleic acid sequence for a protein bound by a protein homodimer comprises two identical half-sites. Alternatively, the two half-sites comprised by a recognition site within a nucleic acid sequence for a protein may be unlike in sequence; it is usually true that dissimilar half-sites are bound by different targeting amino acid sequences, as would be found on the two subunits of a protein heterodimer. Depending on their orientation relative to one another, recognition sites within a nucleic acid sequence for a protein comprising non-identical, but similar, half-sites may also be said to have dyad symmetry.
As used herein, the term xe2x80x9ctargeting amino acid sequencexe2x80x9d refers to an amino acid sequence present on a protein which sequence recognizes a recognition site within a nucleic acid sequence for a protein on a nucleic acid molecule. A protein may comprise one or a plurality (two or more) of targeting amino acid sequences and bind one or a plurality of different recognition sites within a nucleic acid sequence for a protein or proteins. A given targeting nucleic acid sequence may recognize and bind one recognition site within a nucleic acid sequence for a protein or different recognition sites within a nucleic acid sequence for a protein or proteins on a nucleic acid molecule. xe2x80x9cDifferent targeting amino acid sequencesxe2x80x9d, herein defined as those which differ by at least one amino acid, may recognize and bind the same recognition site within a nucleic acid sequence for a protein or proteins, different recognition sites within a nucleic acid sequence or sequences for a protein or proteins, or two partially-overlapping sets of different recognition sites within a nucleic acid sequence for a protein or proteins on a nucleic acid molecule.
It is contemplated that different targeting amino acid sequences, as defined above, may exist on a single polypeptide molecule; typically, however, different targeting amino acid sequences are found on different polypeptide molecules that are of use in the invention. If a polypeptide should possess two or more targeting amino acid sequences, and these targeting amino acid sequences differ in the sequence of at least one amino acid (whether or not they differ in binding-site specificity), that single polypeptide molecule comprises more than one different protein, as defined herein.
The term xe2x80x9chalf-sitexe2x80x9d is not applicable to a recognition site within a nucleic acid sequence for a protein (whether in whole or in part) which is recognized by a protein that binds nucleic acids alone, rather than in a di- or multimeric complex, regardless of the presence of any internal symmetry or repetition of sequence in such a recognition site within a nucleic acid sequence for a protein.
As used herein, the term xe2x80x9cdifferent proteinxe2x80x9d refers to two or more proteins which differ in the identity of at least one amino acid within a targeting amino acid sequence.
It is contemplated that different recognition sites within a nucleic acid sequence for a protein on a nucleic acid molecule or molecules may be recognized and bound by the same targeting amino acid sequence, by different targeting amino acid sequences, or by two partially-overlapping sets of different targeting amino acid sequences of a protein or proteins.
It is preferred that the protein which is bound to a member thereof comprises a detectable label.
Preferably, the protein is a chimeric protein.
As used herein, the term xe2x80x9cchimericxe2x80x9d refers to a protein which comprises fused sequences of two or more polypeptides that are, themselves, different in amino acid sequence and are typically encoded by different genes. The term xe2x80x9cdifferent genesxe2x80x9d may refer to allelic of mutant variants of a gene present at a single genetic locus; preferably, it refers to two or more genes which are found at a corresponding number of genetic loci, and which may be selected from one or more individual organisms or species of organism. A chimeric protein may be advantageously produced by the in-frame fusion and subsequent expression of nucleic acid sequences encoding the component amino acid sequences. Such amino acid sequences may each comprise an entire protein; alternatively, one or more sequence comprised by a chimeric protein may be a fragment of a protein. Typically, each segment is sufficient in scope to retain its native biological activity (e.g. a targeting amino acid sequence which binds a recognition site within a nucleic acid sequence for a protein on a nucleic acid molecule in the context of its native protein will do so in the context of the chimera).
It contemplated that a chimeric (or xe2x80x9cfusionxe2x80x9d) protein according to the invention comprises a protein which binds a recognition site within a nucleic acid sequence for a protein, fused to a second protein component comprising any one of a receptor, an enzyme, a candidate enzyme domain such as a kinase or a protease domain, a candidate protein:protein dimerization domain, a candidate ligand binding domain, or a substrate for a protein-directed enzymatic reaction. In this context, a xe2x80x9cproteinxe2x80x9d is either a whole protein or a protein fragment which retains its ability to recognize- and bind specifically to a recognition site within a nucleic acid sequence for a protein on a nucleic acid molecule to which site the native, whole protein binds.
As used herein, the term xe2x80x9cdomainxe2x80x9d is a portion of a protein molecule which is sufficient for the performance of a given function, whether in the presence or absence of other sequences of the protein. It is contemplated that a domain is encoded by an uninterrupted amino acid sequence, such that it may be physically cleaved whole away from other amino acid sequence elements and such that it will fold properly without the influence of neighboring sequences.
It is preferred that the chimeric protein comprises a DNA-binding domain fused in-frame with a protein:protein dimerization domain.
As used herein with regard to protein domains, the term xe2x80x9cDNA-bindingxe2x80x9d refers to a function of the domain, which is to bind to a recognition site within a nucleic acid sequence for a protein on a DNA molecule.
In another preferred embodiment, the chimeric protein comprises a DNA-binding domain fused in-frame to Green Fluoresccnt Protein.
Preferably, the solid support is a silica support.
It is preferred that the first strand is produced by chemical synthesis and the second strand is produced by enzymatic synthesis.
Preferably, the first strand is used as the template on which the second strand is enzymatically produced.
It is preferred that the first strand of each member contains at its 3xe2x80x2 end a binding site for an oligonucleotide primer which is used to prime enzymatic synthesis of the second strand, and at its 5xe2x80x2 end a variable sequence.
The term xe2x80x9coligonucleotide primerxe2x80x9d, as used herein, refers to a single-stranded DNA or RNA molecule that is hybridized to a nucleic acid template to prime enzymatic synthesis of a second nucleic acid strand.
Preferably, enzymatic synthesis is performed using an enzyme.
In a preferred embodiment, the oligonucleotide primer is between 10 and 30 nucleotides in length.
It is preferred that the first strand comprises DNA.
It is additionally preferred that the second strand comprises DNA.
Preferably, the first and second strands each comprise from 16 to 60 monomers selected from the group that includes ribonucleotides and deoxyribonucleotides.
Use of the term xe2x80x9cmonomerxe2x80x9d is made to indicate any of the set of molecules which can be joined together to form an oligomer or polymer. The set of monomers useful in the present invention includes, but is not restricted to, for the example of oligonucleotide synthesis, the set of nucleotides consisting of adenine, thymine, cytosine, guanine, and uridine (A, T, C, G, and U, respectively) and synthetic analogs thereof. As used herein, xe2x80x9cmonomerxe2x80x9d refers to any member of a basis set for synthesis of an oligomer. Different basis sets of monomers may be used at successive steps in the synthesis of a polymer.
Preferably, at least a portion of the plurality have a second nucleic acid strand that is substantially complementary to- and base-paired with the first strand along the entire length of the first strand.
As used herein in reference to a plurality of nucleic acid members of an array, the term xe2x80x9cportionxe2x80x9d refers to at least two members of an array. Preferably, a portion refers to a number of individual members of an array, such as at least 60%, 80%, 90% and 95-100% of such members.
Another aspect of the present invention is a method for the construction of a synthetic array of surface-bound, bimolecular, double-stranded nucleic acid molecules, comprising the steps of providing an array of first nucleic acid strands linked to a solid support, hybridizing to the first strands an oligonucleotide primer that is substantially complementary to a sequence comprised by a first strand, performing enzymatic synthesis of a second nucleic acid strand that is complementary to a first strand so as to permit Watson-Crick base pairing and so as to form an array comprising a plurality of bimolecular, double-stranded nucleic acid molecule members, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein and wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member, and incubating the array with a protein sample comprising a protein under conditions that permit specific binding of the protein to a member of the array, such that a protein becomes bound to a recognition site within a nucleic acid sequence for a protein on a member to form a nucleic acid protein array.
Preferably, the 3xe2x80x2 end of the first strand is linked to the support.
It is preferred that the 5xe2x80x2 end of the first strand and the 3xe2x80x2 end of the second strand are not linked via a covalent bond.
It is additionally preferred that the 5xe2x80x2 end of the second strand is not linked to the solid support.
Preferably, the recognition site within a nucleic acid sequence for a protein is selected from the group that includes naturally-occurring recognition sites within a nucleic acid sequence for a protein or proteins, synthetic variants of naturally-occurring recognition sites within a nucleic acid sequence for a protein or proteins and randomized nucleic acid sequences.
Preferably, the recognition site within a nucleic acid sequence for a protein comprises two half-sites, wherein either is recognized by a different protein than is the other.
It is preferred that the protein which is bound to a member of the array comprises a detectable label.
It is also preferred that the protein is a chimeric protein.
In a particularly preferred embodiment, the chimeric protein comprises a DNA-binding domain fused in-frame with a protein:protein dimerization domain.
It is also particularly preferred that the chimeric protein comprises a DNA-binding domain fused in-frame to Green Fluorescent Protein.
Preferably, the solid support is a silica support.
It is preferred that the first strand of each member contains at its 3xe2x80x2 end a binding site for an oligonucleotide primer which is used to prime enzymatic synthesis of the second, and at its 5xe2x80x2 end a variable sequence, wherein the binding site is present in each member of the array.
Preferably, enzymatic synthesis is performed using an enzyme.
In a preferred embodiment, the oligonucleotide primer of is between 10 and 30 nucleotides in length.
It is preferred that the first strand comprises DNA.
It is additionally preferred that the second strand comprises DNA.
Preferably, the first and second strands each comprise from 16 to 60 monomers selected from the group that includes ribonucleotides and deoxyribonucleotides.
In a highly preferred embodiment, the solid support is a silica support and the first and second strands each comprise from 16 to 60 monomers selected from the group that includes ribonucleotides and deoxyribonucleotides.
Preferably, the protein sample comprises a candidate inhibitor of binding of the protein to a recognition site within a nucleic acid sequence for a protein on a member of the array.
It is preferred that the protein sample comprises a candidate inhibitor of binding of the protein to a second protein.
The invention also encompasses a method of determining a consensus nucleic acid sequence for a recognition site within a nucleic acid sequence in a nucleic acid molecule for a protein comprising the steps of providing a nucleic acid protein array comprising a solid support and a plurality of bimolecular double-stranded nucleic acid molecule members, a member comprising a first nucleic acid strand linked to the solid support and a second nucleic acid strand which is substantially complementary to the first strand and complexed to the first strand by Watson-Crick base pairing, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein, wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member and wherein a protein comprising a detectable label is bound to a member thereof, and performing a detection step to detect the presence of the label on a feature of the array, wherein nucleotides that are shared among the recognition sites within a nucleic acid sequence for a protein present on features on which the label is detected form a consensus nucleic acid sequence for a recognition site within a nucleic acid sequence for a protein specific for the protein.
As defined herein in reference to recognition sites within a nucleic acid sequence for a protein or proteins, the term xe2x80x9cconsensusxe2x80x9d refers to a common nucleic acid sequence wherein the nucleotide at each position thereof represents that which is most frequently found in recognition sites within a nucleic acid sequence for a selected protein or group of proteins. A consensus sequence may be identical to a naturally-occurring recognition site within a nucleic acid sequence for a protein; alternatively, it may have a sequence which does not occur naturally in the genome of an organism.
As used herein, the term xe2x80x9csharedxe2x80x9d refers to a nucleotide or ribonucleotide which is present in all, or substantially all sequences compared, wherein substantial sharing is defined as the presence in 75% or more of said sequences of a given nucleotide or ribonucleotide at a specified position.
The invention additionally provides a method of identifying for a first protein which binds a nucleic acid as half of a protein:protein heterodimer complex one or a plurality of candidate second proteins with which it might dimerize and bind a nucleic acid molecule in vivo, comprising the steps of providing a nucleic acid array comprising a solid support and a plurality of bimolecular double-stranded nucleic acid molecule members, a member comprising a first nucleic acid strand linked to the solid support and a second nucleic acid strand which is substantially complementary to the first strand and complexed to the first strand by Watson-Crick base pairing, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein, wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member, wherein a binding site comprises two half-sites and wherein either of the half-sites of a recognition site within a nucleic acid sequence for a protein is recognized by a different protein than is the other, incubating the array with a protein sample comprising a first protein which recognizes a first half-site of a recognition site within a nucleic acid sequence within a nucleic acid sequence for a protein and one or a plurality of candidate second proteins under conditions which permit heterodimerization of a first and candidate second protein and binding of a protein:protein heterodimer to a recognition site within a nucleic acid sequence for a protein, recovering a protein:protein heterodimer complex from a member of the array under conditions whereby the first protein and candidate second protein dissociate from one another, and identifying the candidate second protein, wherein each candidate second protein so identified represents a protein with which the first protein may dimerize in vivo.
Preferably, identifying of the candidate second protein comprises sequencing thereof.
In another preferred embodiment, identifying of the candidate second protein comprises binding of the candidate second protein to an antibody which is specific therefor.
It is preferred that the first protein comprises a detectable label.
It is additionally preferred that the method further comprises the step of performing a detection step to detect the presence of the label on a feature of the array, wherein the recognition site within a nucleic acid sequence for a protein present on a feature upon which the label is detected represents a candidate recognition site within a nucleic acid sequence for a protein which the heterodimer may bind in vivo.
The invention also provides a method of identifying candidate members of a set of co-regulated genes, comprising the steps of providing a nucleic acid protein array comprising a solid support and a plurality of bimolecular double-stranded nucleic acid molecule members, a member comprising a first nucleic acid strand linked to the solid support and a second nucleic acid strand which is substantially complementary to the first strand and complexed to the first strand by Watson-Crick base pairing, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein, wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member and wherein a protein comprising a detectable label is bound to a member thereof, and performing a detection step to detect the presence of the label on a feature of the array, wherein a gene having among its regulatory sequences one or more of the recognition sites within a nucleic acid sequence for a protein present on a feature on which the label is detected is characterized as a candidate member of a set of co-regulated genes that are regulated by the protein.
A xe2x80x9cset of co-regulated genesxe2x80x9d refers to a number of genes, in the range of about 2 to about 30 genes, that exhibit a given response (in terms of gene expression) to an external stimulus or a given response to a mutation in a specific gene. An example of the latter is where a mutation in the coding region of gene X results in a change in expression levels of genes A-Z. The term xe2x80x9cco-regulated set of genesxe2x80x9d additionally encompasses genes which are normally under the control of a common trans-regulatory factor, such as a protein. The upper limit on the number in a set of co-regulated genes (i.e., xe2x80x9cpositivesxe2x80x9d or up-regulated genes; or xe2x80x9cnegativesxe2x80x9d or down-regulated genes) may be on the order of several thousand.
Another aspect of the present invention is a method of assaying a candidate inhibitor of protein/nucleic acid interactions, comprising the steps of providing a nucleic acid array comprising a solid support and a plurality of bimolecular double-stranded nucleic acid molecule members, a member comprising a first nucleic acid strand linked to the solid support and a second nucleic acid strand which is substantially complementary to the first strand and complexed to the first strand by Watson-Crick base pairing, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein, wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member, incubating the array with a protein sample comprising a protein comprising a detectable label and a candidate inhibitor of binding of the protein to a recognition site within a nucleic acid sequence for a protein on a member of the array, under conditions which normally permit binding of the protein to that member, and performing a detection step to detect the presence of the label on the member, wherein the presence of the label on the member corresponds with binding of the protein to the member and wherein the negation of- or reduction in binding of the protein to the member is indicative of efficacy of the candidate inhibitor of protein:nucleic acid interactions in inhibiting binding of the protein to the recognition site within a nucleic acid sequence for a protein.
Such protein:nucleic interactions include, but are not limited to, recognition of cis-regulatory elements by transcription factors, which may include receptors or polymerase subunits, binding of nucleic acid molecules by structural proteins, such as histones or cytoskeletal components, and recognition of a nucleic acid molecule by restriction- or other endonucleases, exonucleases and nucleic acid modification enzymes (such as methylases, ligases, phospatases, isomerases, transposases or other recombinases, glycosylases and kinases).
The final aspect of the present invention is a method of assaying a candidate inhibitor of a protein/protein interaction, comprising the steps of providing a nucleic acid array comprising a solid support and a plurality of bimolecular double-stranded nucleic acid molecule members, a member comprising a first nucleic acid strand linked to the solid support and a second nucleic acid strand which is substantially complementary to the first strand and complexed to the first strand by Watson-Crick base pairing, wherein for at least a portion of the members, each member comprises a recognition site within a nucleic acid sequence for a protein, wherein a recognition site within a nucleic acid sequence for a protein of a first member is different from a recognition site within a nucleic acid sequence for a protein of a second member, incubating the array with a protein sample comprising a first protein comprising a detectable label, wherein binding of the first protein to a recognition site within a nucleic acid sequence for a protein on a member of the array is dependent upon an interaction between the first protein and a second protein and wherein the protein sample further comprises the second protein and a candidate inhibitor of the interaction, under conditions which normally permit the interaction, and performing a detection step to detect the presence of the label on a member of the array, wherein the presence of the label on a member corresponds with binding of the protein to that member and wherein the negation of- or reduction in binding of the protein to the member is indicative of efficacy of the candidate inhibitor in inhibiting the interaction between the first protein and the second protein.
Such protein:protein interactions include, but are not limited to, ligand/receptor interactions, enzyme/substrate interactions, interactions between subunits of a nucleic acid polymerase, and interactions between molecules of homo- or heterodimeric or -multimeric complexes.
The utilization of bimolecular, double-stranded, nucleic acid arrays comprising recognition sites within a nucleic acid sequence for a protein or proteins or that of nucleic acid/protein arrays according to the invention provides an improvement over prior art methods in that while the first strand of the DNA duplex is chemically-synthesized on the support matrix, the second strand is enzymatically produced using the first strand as a template. While the error rate in production of the first strand remains the same, increased fidelity of second strand synthesis is expected to result in a higher percentage of points on the matrix surface that are filled by hybridized DNA duplex molecules that can serve as targets for protein binding- or other assays. In addition, oligonucleotide priming of second nucleic acid strand synthesis obviates the need for covalent linkage of complementary regions, with the effect of reducing extraneous sequence or non-nucleic acid material from the array, as well as eliminating steps of designing and synthesizing such a linker.
Further features and advantages of the invention will become more fully apparent in the following description of the embodiments and drawings thereof, and from the claims.