The molecular blueprint for a living eukaryotic organism is stored in double-stranded deoxyribose-nucleic acid ("DNA") molecules within the nucleus of each cell of the organism. Each double-stranded DNA molecule comprises a large number of templates, called genes, that each specifies the composition of a protein molecule and a large number of regulatory regions and additional regions for which a functionality has not yet been identified. Protein molecules are synthesized from the gene templates in a two-step process. In the first step, called transcription, the gene is copied to produce a molecule of messenger ribose-nucleic acid ("RNA"). In the second step, called translation, a protein molecule is synthesized according to the information contained in the messenger RNA molecule. The regulatory regions of a double-stranded DNA molecule act as switches, brakes, and accelerators for controlling the transcription of genes into messenger RNA molecules, thereby controlling the rate of synthesis of the various proteins specified by the genes. Proteins serve as catalysts for the myriad of chemical actions that occur within living organisms, as well as structural and mechanical elements from which living organisms are formed. Thus, the regulation of protein formation via the regulatory regions of double-stranded DNA molecules controls the development, structure, and dynamic composition of living cells.
Both proteins and DNA molecules are long linear polymers synthesized from a relatively small number of component molecules, or subunits. FIG. 1 shows the twenty amino acid subunits from which protein molecules are commonly synthesized. Each amino acid subunit has an .alpha.-carboxyl group (e.g., the .alpha.-carboxyl group 101 of the amino acid lysine 103), an .alpha.-amino group (e.g., the .alpha. amino group 105 of the amino acid lysine 103), and a side chain (e.g., the .gamma.-amino propyl side chain 107 of the amino acid lysine 103), all attached to an .alpha.-carbon atom (e.g., the .alpha.-carbon 109 of the amino acid lysine 103). FIG. 2 shows a small polypeptide polymer built from four amino acids. The polypeptide polymer 200 has a free .alpha.-amino group 202 at the N-terminal end 204 of the polypeptide polymer 200 and a free .alpha.-carboxyl group 206 at the C-terminal end 208 of the polypeptide polymer 200. The polypeptide polymer 200 is composed from the following amino acids: (1) alanine 210; (2) tyrosine 212; (3) aspartic acid 214; and (4) glycine 216. A protein comprises one or more polypeptide polymers, similar to the polypeptide polymer 200 shown in FIG. 2, each generally comprising tens to hundreds of amino acid subunits.
The amino acid subunits within a protein are normally designated by either three-letter symbols or by one-letter symbols. Table 1, below, lists both the three-letter symbols and the one-letter symbols corresponding to each of the amino acids:
______________________________________ Three One Letter Letter Amino Acid Symbol Symbol ______________________________________ Alanine Ala A Argine Arg R Asparagine Asn N Aspartic acid Asp D Asparagine or aspartic acid Asx B Cysteine Cys C Glutamic acid Glu E Glutamine Gln Q Glutamine or glutamic acid Glx Z Glycine Gly G Histidine His H Isoleucine Ile I Leucine Leu L Lysine Lys K Methionine Met M Phenylalanine Phe F Proline Pro P Serine Ser S Threonine Thr T Tryptophan Trp W Tyrosine Tyr Y Valine Val V ______________________________________
A protein can be chemically described by writing its amino acid subunit sequence using either the three-letter symbols or the one-letter symbols, listed in Table 1, for the amino acids of the protein, starting from the N-terminal amino acid on the left side and ending with the C-terminal amino acid on the right side. For example, the polypeptide polymer displayed in FIG. 2 can be described either as "ALA-TYR-ASP-GLY" or "AYDG." Although a protein can be conceptualized as a linear sequence of amino acids, the protein molecule in solution normally folds into a complex and specific three-dimensional shape. FIG. 3 shows a representation of the three-dimensional shape of a relatively small, common protein.
DNA molecules, like proteins, are linear polymers. DNA molecules are synthesized from only four different types of subunit molecules: (1) deoxy-adenosine, abbreviated "A"; (2) deoxy-thymidine, abbreviated "T"; (3); deoxy-cytosine, abbreviated "C"; and (4) deoxy-guanosine, abbreviated "G." FIG. 4 illustrates a short DNA polymer 400, called an oligomer, composed of the following subunits: (1) deoxy-adenosine 402; (2) deoxy-thymidine 404; (3) deoxy-cytosine 406; and (4) deoxy-guanosine 408. When phosphorylated, these subunits of the DNA molecule are called nucleotides, and are linked together through phosphodiester bonds 410-415 to form the DNA polymer. The DNA molecule has a 5' end 418 and a 3' end 420. A DNA polymer can be chemically characterized by writing, in sequence from the 5' end to the 3' end, the single letter abbreviations for the nucleotide subunits that together compose the DNA polymer. For example, the oligomer 400 shown in FIG. 4 can be chemically represented as "ATCG." A nucleotide comprises a purine or pyrimidine base (e.g. adenine 422 of the deoxy-adenylate nucleotide 402), a deoxy-ribose sugar (e.g. ribose 424 of the deoxy-adenylate nucleotide 402), and a phosphate group (e.g. phosphate 426) that links the nucleotide to the next nucleotide in the DNA polymer.
The DNA polymers that contain the organizational information for living organisms occur in the nuclei of cells in pairs, called double-stranded DNA helixes. One polymer of the pair is laid out in a 5' to 3' direction, and the other polymer of the pair is laid out in a 3' to 5' direction. The two DNA polymers in the double-stranded DNA helix are therefore described as being anti-parallel. The two DNA polymers, or strands, within a double-stranded DNA helix are bound to each other through hydrogen bonds. Because of a number of chemical and topographic constraints, a deoxy-adenylate subunit of one strand must hydrogen bond to a deoxy-thymidylate subunit of the other strand, and a deoxy-guanylate subunit of one strand must hydrogen bond to a deoxy-cytidylate subunit of the other strand.
FIG. 5 illustrates the hydrogen bonding that joins two anti-parallel DNA strands. The first strand 502 occurs in the 5' to 3' direction and contains a deoxy-adenylate subunit 504 and a deoxy-guanylate subunit 506. The second, anti-parallel strand 508 contains a deoxy-thymidylate subunit 510 and a deoxy-cytidylate subunit 512. The deoxy-adenylate subunit 504 is joined to the deoxy-thymidylate subunit 510 through hydrogen bonds 514 and 516. The deoxy-guanylate subunit 506 is joined to the deoxy-cytidylate subunit 512 through hydrogen bonds 518-522.
The two DNA strands linked together by hydrogen bonds form the familiar helix structure of the double-stranded DNA helix. FIG. 6A illustrates a short section of a DNA double helix 600 comprising a first strand 602 and a second, anti-parallel strand 604. A deoxy-guanylate subunit in one strand 606 is always paired with a deoxy-cytidylate subunit 608 in the other strand, and a deoxy-thymidylate subunit in one strand 610 is always paired with a deoxy-adenylate subunit in the other strand 612. FIG. 6B shows a representation of the two strands illustrated in FIG. 6A using the single-letter designations for the nucleotide subunits. The first strand 614 (602 in FIG. 6A) is written in the familiar 5' to 3' direction, and the second strand 616 (604 in FIG. 6A) is written in the 3' to 5' direction in order to clearly show the subunit pairings between the two strands. These pairings are called base pairs because the hydrogen bonding occurs between the purine and pyrimidine bases of the nucleotide subunits. Nucleotide subunits are often referred to as bases. There is a "C" (e.g., 618) in the second strand directly opposite from each "G" (e.g., 620) in the first strand, an "A" (e.g., 622) in the second strand directly opposite from each "T" (e.g., 624) in the first strand, a "T" (e.g., 626) in the second strand directly opposite from each "A" (e.g., 628) in the first strand, and a "G" (e.g., 630) in the second strand directly opposite from each "C" (e.g., 632) in the first strand. Thus, knowing the sequence for the first strand, one can immediately determine and write down the sequence for the second strand. DNA base-pair sequences are always written in the 5' to 3' direction. The second strand 634 is shown properly written in the 5' to 3' direction as the last sequence in FIG. 6B. When written in this fashion, the second strand is said to be the reverse complement of the first strand. Thus, the "G" 636 on the left or 5' end of the second strand 634 is paired in the DNA double helix 600 with the "C" 638 at the right or 3' end of the first strand 614.
As described above, the synthesis of proteins from gene templates is controlled through regulatory regions of DNA molecules. A large number of different types of DNA-binding proteins bind to these regulatory regions of DNA molecules and, by so doing, initiate, promote, inhibit, or prevent the synthesis of one or more specific genes. FIG. 7A illustrates the binding of a dimeric, or two-polymer DNA-binding protein 702 to a specific regulatory region 704 of a double-stranded DNA helix 706. In general, a number of amino acid subunits of a DNA-binding protein hydrogen bond to nucleotide subunits of the DNA molecule to affect the binding of the DNA-binding protein to the DNA double helix. FIG. 7B illustrates two hydrogen bonds 708 and 710 between an amino acid subunit 712 of a DNA-binding protein 714 and a nucleotide subunit 716 of a DNA double helix 718 viewed down the central axis of the DNA double helix.
FIG. 8 illustrates the spatial relationship between a gene and various regulatory regions of a DNA double helix that control transcription of the gene. The gene 802 is generally preceded by a promoter region 804 where various molecular components 806 are assembled in order to catalyze the synthesis of messenger RNA from the gene template. In addition, various regulatory DNA-binding proteins or assemblies of regulatory DNA-binding proteins 808-810 specifically bind to a number of regulatory regions of the DNA double helix 811-813 that are located at various distances along the DNA double helix from the gene 802. In general, the regulatory proteins may either increase the rate of gene transcription or decrease the rate of gene transcription, thus controlling the concentration of the protein specified by the gene within the cell. Each type of regulatory DNA-binding protein recognizes and binds to a specific sequence, or pattern, of base pairs within the regulatory region. These sequences, called binding sites, are generally less than twenty nucleotides in length.
The molecular state of a cell and of an entire living organism largely depends on the regulation of gene transcription by thousands of different regulatory DNA-binding proteins. Only one or several molecules of each different type of regulatory protein may occur in a cell at any given time. A cell thus contains a very complex mixture of regulatory DNA-binding proteins, and each regulatory DNA-binding protein may occur in the mixture at extremely small concentrations. Aberrations in the structures of certain regulatory DNA-binding proteins, or in the concentrations of certain regulatory DNA-binding proteins within cell nuclei, may underlie many different diseases and disorders, including developmental problems, inherited genetic disorders, and cancers. It is therefore a goal of biological sciences and of the biotechnology industry to identify and characterize the many different types of regulatory DNA-binding proteins.
There are a number of different approaches to identifying regulatory DNA-binding proteins. One such approach is called the multiplex selection technique, or "MuST.TM.." The MuST technique is described in the following patent applications, which are hereby incorporated by reference in their entirety: U.S. patent application Ser. No. 08/590,571, filed Jan. 24, 1996, PCT application Serial No. PCT/US97101230, filed Jan. 24, 1997, and U.S. application Ser. No. 08/906,691 filed Aug. 6, 1997. In this method, a very large number of relatively short oligonucleotide DNA duplexes having random sequences are prepared and mixed together with a sample that contains various DNA-binding proteins. The random-sequence oligonucleotide duplexes generally have lengths of between eight and twelve base pairs. After the random-sequence oligonucleotide duplexes are mixed with the DNA-binding proteins, the DNA-binding proteins bind to specific oligonucleotide duplexes that contain base-pair sequences that the DNA-binding proteins recognize; or, in other words, a particular type of DNA-binding protein binds to those oligonucleotide duplexes that contain base-pair sequences identical or similar to the base pair sequence of the binding site within the regulatory region of the DNA double helix controlled by that DNA-binding protein. Various biochemical separation techniques are employed to separate the DNA-binding proteins bound to the oligonucleotide duplexes from unbound proteins, unbound oligonucleotide duplexes, and other molecules within the mixture. The bound DNA-binding protein/oligonucleotide duplex pairs are then separated, the separated oligonucleotide duplexes are amplified by the polymerase chain reaction ("PCR") technique and, finally, the two strands of the oligonucleotide duplexes are separated and identified by sequence analysis. The result of the analysis is a list of nucleotide sequences of single strands of the oligonucleotide duplexes that were bound by DNA-binding proteins in the mixture.
DNA-binding proteins have varying specificities for base-pair sequences. Each different type of DNA-binding protein generally recognizes and binds to a particular binding site within a particular regulatory region of a DNA double helix. The binding site comprises a specific sequence of base pairs within the DNA double helix. However, a particular DNA-binding protein may recognize and bind to any number of sequences similar to the sequence of the binding site which the DNA-binding protein normally recognizes and to which the DNA-binding protein binds. Base-pair sequence analysis is conducted on single strands of DNA rather than on DNA duplexes. A DNA-binding site for a particular DNA-binding protein will be therefore characterized, following an analysis of oligonucleotide sequences produced by the MuST technique, by a set of similar sequences corresponding to one strand of the duplex regions bound by the DNA-binding protein and by a set of similar sequences corresponding to the other strand of the duplex regions bound by the DNA-binding protein. Because the two sets of sequences are related by reverse complementation, the original two sets are merged into a single set of sequences by applying reverse complementation to the sequences in one of the original two sets. Because the oligonucleotide duplexes employed in the MuST technique are randomly generated, the first base pair of the sequence recognized by a DNA-binding protein may not correspond to the first base pair of the oligonucleotide duplex, but may occur at many different positions within the oligonucleotide duplex. Generally, a DNA-binding protein may bind to some minimum number of base pairs that compose a sub-sequence of the sequence of the binding-site. Because the MuST oligonucleotide sequences are random, a particular binding site for a particular DNA-binding protein will be characterized within the set of sequences produced by the MuST technique by a set of oligonucleotide sequences that contain sub-sequences identical or similar to sub-sequences of the binding site sequence greater than or equal in length to some minimum number of nucleotides.
FIG. 9 illustrates the characterization of various clusters representing potential DNA-binding sites from a set of sequences produced by the MuST technique. A set of 21 sequences 902 represents the oligonucleotide sequences identified by the MuST technique. As commonly applied to cell extracts containing DNA-binding proteins, the MuST technique may produce a set of many thousands of sequences. FIG. 9 is intended to illustrate the general concept of MuST sequence analysis rather than provide an actual example.
Examination of the set of MuST sequences 902 does not immediately reveal a pattern of related sequences. However, as a result of an exhaustive comparison of each sequence in the set of sequences 902 to the other sequences in the set of sequences 902 by shifting the sequences relative to one another, and identifying common sub-sequences, five clusters of related sequences 904-908 can be identified. Each sequence of the first cluster of sequences 904 contains a common seven-base-pair sub-sequence "GTTTACC" or some very similar variation of that sub-sequence. These common sub-sequences within each of the sequences of the first cluster 904 are indicated by box 906. Note that the common sub-sequence occurs towards the end of sequence 13 (908 in FIG. 9) in which the final two nucleotides of the common sub-sequence are missing. It should also be noted that, in some sequences, one or more nucleotides of the common sub-sequence have been substituted with another. For example, sequence 18 (910 in FIG. 9) contains an initial "C" 912 rather than a "G." Sequence 18 (910 in FIG. 9) is shifted three positions to the right relative to sequence 17 (914 in FIG. 9) and is shifted four places to the right relative to sequence 19 (916 in FIG. 9) in order that the common sub-sequence of sequence 18 aligns with the common sub-sequences of sequences 17 and 19. Sequence 22 (918 in FIG. 9) in the original set of sequences 902 does not initially appear to have a portion in common with any of the other sequences. However, the reverse complement of sequence 22 (918 in FIG. 9) is identical with sequence 1 (920 in FIG. 9) and is therefore included, along with sequence 1, in the first cluster 904. The lines between the sequences in the set of MuST sequences 902 and the sequences within clusters 904-908 (e.g., line 924) show a mapping from the original MuST sequences to the five clusters. It is this mapping between oligonucleotide sequences and clusters, including the alignments and reverse complementation required to match the common sub-sequences within the sequences of a cluster, that is the goal of the computational technique of the described embodiment of the present invention.
Each of the clusters 904-908 that are identified from the original set of MuST sequences 902 represents a potential DNA-protein binding site. The number of sequences within a cluster may be related to the concentration in the original cell extract mixture of the DNA-binding protein that recognizes the common sequence within that cluster. Clusters with one or a few sequences, such as cluster 2 (905 in FIG. 9) and cluster 4 (907, in FIG. 9) may represent a binding site to which an extremely rare or low-concentration regulatory DNA-binding protein binds, or may possibly represent an artifact arising from experimental methodologies.
Once a binding site has been identified by analysis of the MuST sequences, that binding site can be compared to data bases of known binding sites to determine whether the binding site has been previously characterized. The DNA-binding proteins that bind to a particular binding site can be purified from complex mixtures by various biochemical techniques. The sequence of amino acids that together compose the one or more polymers of the DNA-binding protein can be determined from the purified protein by biochemical protein sequence analysis techniques. Once the sequence for a DNA-binding protein has been determined, that sequence can be compared to data bases of known protein sequences or can serve as the basis for the identification of the gene or genes within an organism's DNA molecules that serve as a template for the synthesis of that DNA-binding protein. These various characterizations of the DNA-binding protein may eventually lead to the identification of diseases associated with aberrations in the structure of the protein or in the control of the expression of the gene that is the template for the DNA binding protein. These various characterizations may also lead to various ameliorative therapies that can be employed to treat such diseases.