If the completion of Human Genome Project (HGP) is perceived as the scientific landmark of 2003, the creation of DNA microarrays containing a complete set of 50,000 cDNA probes for the entire human genome by Affymetrix Inc. could be regarded as another milestone of the year (Pennisi, Science 302: 211, 2003). However, the current genomic sequence data from humans, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Escherichia coli still can not include all the mutations and genetic divergence such as Single Nucleotide Polymorphisms (SNPs).
Additionally, this genome-wide probing system does not have the capacity to detect exogenous genes and their products. Some pathological processes such as infections often involve exogenous genetic factors. For drug targets and lead discovery and validation, for the clinical diagnosis and prognosis, for clinical treatment and therapy development, a standardized, universal DNA Array technology platform that has a full-range screening spectrum for all possible endogenous and exogenous genes seems more desirable.
The maintenance and replication of a genome-wide cDNA library demands quality controls. It can be time-consuming and add to the cost of production (Knight et al., Nature 414: 135-136, 2001). A cDNA library is a specialized library that may even have cell type specifics. Such characteristics set a limit for its applications. Another drawback is the probability of contamination during production. Zacharewski's laboratory has sequenced 1,189 cDNAs of a set of probes of DNA microarrays. Only 62% of them definitely represent the correct sequences (Halgren et al., Nucleic Acids Res. 29: 582-588, 2001). Up to 30% error rates of cDNA probes were also identified by three major centers of DNA microarrays (Knight, Nature 410: 860-861, 2001). Therefore, it is still in need to create new probe libraries, which have genuine genome-wide screening spectrum with more accuracy but low cost. Chemically synthesized oligonucleotides provide an alternative option. The process of chemical synthesis prevents problems from possible bacterial contamination and preserves the accuracy of designed sequences. Short oligonucleotides (9-30 mers) could be used as primer in Polymerase Chain Reaction (hereinafter PCR) whereas cDNA molecules generally could not, as most cDNA molecules are longer than 30 mers. For example, although expressed sequence tags (EST) have been widely used for gene discovery (Adams et al., Science 252: 1651-1656, 1991), they may not be able to be used as PCR primers directly. The length and GC content of EST are irregular. Thus, EST is unlikely for the use of standardized universal probes.
To address the above issues, the present invention proposes to construct a series of standardized universal probe libraries with all possible 61 or 64 genetic codes (codons) combinatorial according to a series of corresponding genetic algorithms. The inventive codon-based oligonucleotide probe libraries provide genuine genome-wide screening capability. It includes all possible point mutations and SNPs within the designed probe sequences. It has the capacity of targeting all possible endogenous and exogenous genes simultaneously for a given nucleic acid sample related to a biological or medical process or pathway. It is characterized by its unique all-purpose generic usage, regardless of genetic variations among cell types, tissues, organs, individuals and species. Moreover, codon-based oligonucleotide design has sequence orientation. A 5′-ATG oriented codon-based oligonucleotide library could be used as a library of upstream primers for PCR. With oligo-d(T)s as downstream primer, a corresponding cDNA library could be subsequently obtained from a given mRNA sample aided by RT-PCR. The cDNA library could then be used as probe library for cDNA Arrays. The protocols of making and using cDNA Arrays are known in the art (World Wide Website: stanford.edu/pbrown). The current invention presents oligonucleotide probes designed according to template strand of cDNA under DNA complementarity's rules. Hence, a brief review of gene structures for the probe design would be helpful.
While nucleic acids consist of four nucleotides with four distinct bases: Adenine (A), Thymine (T)/Uracil (U), Guanine (G) and Cytosine (C) respectively, the coding sequences of genes are organized in codons which in turn code for specific amino acids. Codons are arranged in an oriented, consecutive and linear manner with a unique starting and end point.
The codons (genetic code) consist of 64 nucleotide triplets: 61 codons encode the 20 essential L-amino acids (EAA) and three codons are stop codons. 5′-GTG, 5′-ATA, 5′-TTG, 5′-ACG and 5′-CTG may function as start codons such as, 5′-ATA is the start codon of mammalian mitochondria. 5′-ATG/5′-AUG is the dominant start codon. There are some exceptions. It is similar to stop codons. There are three dominant stop codons: 5′-TAA/5′-UAA, 5′-TGA/5′-UGA and 5′-TAG/5′-UAG. Exceptions exist. For example, in mammalian mitochondrial, 5′-AGA and 5-AGG are stop codons instead of coding for Arginine.
Although a specific coding region consists of a specific combination of a set of specific codons at a specific length, a given sequence with given length of Open Reading Frame (ORF) of a given gene could be identified among the group of linear consecutive DNA sequences consisting of all possible combinations of 61 codons that encode 20 (EAA). For example, each 5′-terminal sequence of a given ORF has a start codon at its 5′-end. Each 3′-terminal sequence of a given ORF has a stop codon at its 3-end. Thus, any and all terminal sequences of ORF of a given length could be deduced from either its 5′-end or 3′-end according to the genetic algorithm of 61.sup.(n−m) under conditions: n−m=1 or n−m>1, n>m, n−m<infinity, neither n nor m is equal to zero, both n and m are integers, n is the unit of measurement of the length of ORF sequence, n represents the entire length of a given ORF sequence measured by codon or expressed codon (essential amino acid), m represents the length of the pre-determined sequence of terminal orientation for the entire sequence measured by codon or expressed codon (essential amino acid). For example, if 5′-ATG in 5′-ATGGCACTC is the pre-determined sequence of terminal orientation for the entire sequence, then n=3 and m=1. If n=3 and one 5′-ATG is at 5′-end, 3,721 distinct 5′-ATG oriented oligonucleotide sequences of three-codon-length long could be deduced according to algorithm of 61.sup.(n−m). The length of three-codon equals nine-nucleotide (9 mers). The complete collection of above 3,721 distinctive 9-mer oligonucleotide sequences has formed a 9-mer codon-based oligonucleotide probe library accordingly.
5′-end terminal sequence of ORF of a given gene of a given length can be translated into a peptide sequence, which can be identified among the group of peptides of linear consecutive amino acids sequences consisting of all possible combinations of 20 (EM) with a L-amino acid encoded by a start codon at its N-terminal having the same unit number(s) of length as the corresponding 5′-terminal sequence of ORF. Methionine is encoded by 5′-ATG. Thus, any and all N-terminal peptide sequences of a given length could be deduced from its N-terminal(s) according to the genetic algorithm of 20.sup.(n−m) as well under conditions: n−m=1 or n−m>1, n>m, n−m<infinity, neither n nor m is equal to zero, both n and m are integers, n is the unit of measurement of the length of peptide, n represents the entire length of a given peptide sequence measured by EAA (expressed codon), m represents the length of the pre-determined sequence of terminal orientation for the entire sequence measured by EAA (expressed codon). For example, if Methionine (M) in N-MKS Is the pre-determined sequence of terminal orientation for the entire sequence, then n=3 and m=1. If n=6 and one Methionine is at N-terminal (m=1), 3.2 million distinct N-Methionine oriented 6-EAA-length long peptide sequences could be deduced according to algorithm of 20.sup.(n−m). The complete collection of the above 3.2 million distinctive 6-EAA-length long peptide sequences has formed a hexa-peptide library accordingly.
3′-end terminal sequence of ORF of a given gene of a given length can be translated into peptide sequence, which can be identified among the group of peptides of linear consecutive amino acids sequences consisting of all possible combinations of 20 (EAA) having the same unit number(s) of the length as the corresponding 3′-end terminal sequence of ORF. Thus, any and all C-terminal peptide sequences of a given length could be deduced from its C-terminal(s) according to the genetic algorithm of 20.sup.(n−m)/20.sup.n under conditions: n−m=1 or n−m>1, m=zero, n<infinity, n is not equal to zero, n is an integer, n is the unit of measurement of the length of peptide, one of the 20 EAA is at its C-terminal of each peptide of n-EAA-lenqth long. For example, if n=5, 3.2 million distinct 5-EAA-length long peptide sequences of C-terminal orientation could be deduced according to algorithm of 20.sup.n. The complete collection of above 3.2 million distinctive 5-EAA-length long peptide sequences has formed a penta-peptide library accordingly.
The present invention defines 5′-start codon sequence as the common border of ORF and 5′-Untranslated Region (5′-UTR). Therefore, any 3′-end terminal sequence of 5′-UTR oriented by a start codon at its 3′-end of a given gene of a given length could be identified among the group of linear consecutive DNA sequences consisting of all possible combinations of 64 codons with a start codon at its 3′-end with the same given length. Thus, any and all 3′-end terminal sequences of 5′-UTR with a start codon at its 3′-end of a given length could be deduced from its 3′-end with a start codon according to the genetic algorithm of 64.sup.(n−m) under conditions: n>m, n−m<infinity, neither n nor m is equal to zero, n and m are integers, n is the unit of measurement of the length of 5′-UTR sequence, n represents the entire length of a given 5′-UTR sequence measured by codon, m represents the length of the pre-determined sequence of terminal orientation for the entire 5′-UTR sequence measured by codon. When n=1 and m=1, position of codon is (m−n)+1. When n−m>1 and n−m<infinity, position of codon is (m−n). The negative sign in front of n indicates that the codon position is at 5′-UTR. For example, if n=3 and m=1 (one 5′-ATG of 5′ towards 3′ orientation is at 3′-end), 4,096 distinct 3′-GTA oriented oligonucleotide sequences of three-codon-length long could be deduced according to algorithm of 64.sup.(n−m). The length of three-codon equals nine-nucleotide. The complete collection of above 4,096 distinctive 9-mer oligonucleotide sequences has formed a 9-mer codon-based oligonucleotide probe library accordingly.
The present invention defines a 5′-stop codon sequence as the common border of the ORF and 3′-Untranslated Region (3′-UTR). Therefore, a 5′-end terminal sequence of 3′-UTR with a stop codon at its 5′-end of a given gene of a given length can be identified among the group of linear consecutive DNA sequences consisting of all possible combinations of 64 codons with a stop codon at its 5′-end with the same length. Thus, any and all 5′-end terminal sequences of 3′-UTR with a stop codon at its 5′-end of a given length could be deduced from its 5′-end including a stop codon according to the genetic algorithm of 64.sup.(n−m) under the conditions: n−m>1, n−m<infinity, neither n nor m is equal to zero, both n and m are integers, n is the unit of measurement of the length of 3′-UTR sequence, n represents the entire length of a given 3′-UTR sequence measured by codon, m represents the length of the pre-determined sequence of terminal orientation for the entire 3′-UTR sequence measured by codon. For example, if n=3 and m=1 (one 5′-TGA of 5′ towards 3′ orientation is at 5′-end), 4,096 distinct 5′-TGA oriented oligonucleotide sequences of three-codon-length long could be deduced according to algorithm of 64.sup.(n−m). The length of three-codon equals nine-nucleotide. The complete collection of above 4,096 distinctive 9-mer oligonucleotide sequences has formed a 9-mer codon-based oligonucleotide probe library accordingly.
Exceptions exist. For example, 5′-TGA, which usually codes for the termination of the synthesis of a peptide chain, sometimes codes for selenocysteine, an amino acid which is not among the 20 essential amino acids. Other exceptions such as 5′-AGA and 5′-ATA are not usable in Micrococcus Luteus while 5′-CGG is not usable in Mycoplasmas and Spiroplasmas (Kanoi et al., J. Mol. Bio. 230: 51-56, 1993), (Oba et al., Proc. Natl. Acad. Sci. U.S.A. 88: 921-925, 1991). Both 5′-TAA and 5′-TAG encode Glutamine in Tetrahymena, Paramecium and Acetabularia of Cilliates and Algae while 5′-CTG encodes Serine in Candida cylindrica of Fungi (Tourancheau et al., EMBO J. 14: 3262-3267, 1995). However, all above genetic algorithms are applicable to those exceptions as long as the corresponding codon(s) are substituted accordingly. Therefore, the corresponding codon-based oligonucleotide probe library could be established as well.
The point mutations, deletions, insertion and single nucleotide polymorphisms (SNPs) may occur in the coding region or 5′-UTR or 3′-UTR. In terms of functionality, those genetic variation(s) in coding regions are actually a change(s) of codon(s) and/or ORF(s). For example, 5′-GCA encodes Alanine. If G, the single nucleotide of the first position of 5′-GCA, is swapped for an alternate (C, A and T), 5′-CCA encodes Proline; 5′-ACA encodes Threonine; 5′-TCA encodes Serine. If C, the single nucleotide of the second position of 5′-GCA, is swapped for an alternate (G, A and T), 5′-GGA encodes Glycine; 5′-GAA encodes Glutamic acid; 5′-GTA encodes Valine. If A, the single nucleotide of the third position of 5′-GCA, is swapped for an alternate (G, C and T), 5′-GCG encodes Alanine; 5′-GCC encodes Alanine; 5′-GCT encodes Alanine. 5′-GGA encodes Glycine. If G, the single nucleotide of the first position of 5′-GGA, is swapped for T, 5′-GGA will become 5′-TGA, terminator of the peptide chain. 5′-TAA, 5′-TGA and 5′-TAG encode peptide termination respectively. The substitution of any nucleotide at any position of the triplet codons of the three terminators will turn the terminator into a codon for a specific amino acid or another terminator. For example, If T, the single nucleotide of the first position of 5′-TGA, is swapped for an alternate (G, C and A), 5′-TGA, terminator of the peptide chain will become 5′-GGA, 5′-CGA and 5′-AGA which encodes Glycine, Arginine and Arginine respectively. If G, the single nucleotide of the second position of 5′-TGA, is swapped for an alternate (T, C and A), 5′-TGA, terminator of the peptide chain will become 5′-TTA, 5′-TCA and 5′-TAA which encodes Leucine, Serine and termination respectively. If A, the single nucleotide of the third position of 5′-TGA, is swapped for an alternate (G, C and T), 5′-TGA, terminator of the peptide chain will become 5′-TGG, 5′-TGC and 5′-TGT which encodes Tryptophan, Cysteine and Cysteine respectively. The substitution, replacement, deletion and insertion of single or multiple nucleotide(s) in the coding region could cause the shift of ORF(s) and the change(s) of codon(s), the termination of peptide chain and/or the merger of two or more peptide chains together. In appearance, the point mutation, deletion, insertion and SNPs in the coding region is a change(s) of nucleotide(s). In nature, it is actually a change(s) of codon(s) and/or ORF(s). Therefore, codon-based methods could address the nature of those phenomena more directly in comparison with the nucleotide-based methods.
Due to the reductions of the conservations of amino acids near both terminals of peptide chain, terminal sequence tag (TST) of either 5′-end or 3′-end of ORF or combinatorial may have the potential for signature sequence selection. Practically, oligonucleotides ranging from 6 to 24 mers are sufficient to function as probes in hybridization. Therefore, construction of Terminal Sequence Tag (TST) Libraries could become meaningful (Chen et al., Molecular & Cellular Proteomics 2(9): 826, 2003). Although there is often more than one 5′-ATG codon per single gene, such as the full length sequence of Glyceraldehyde-3-phosphate Dehydrogenase (GenBank Accession: NM—002046.2) which has ten 5′-ATG codons in its ORF at the first reading. The first suitable 5′-ATG is usually the start codon. The identification of every 5′-ATG/5′-AUG of a given single gene could facilitate the identification of the start codon and the start site of the ORF of a given gene. Technically, mRNA sequences between 5′-AUG sites and poly(A) could routinely be amplified by RT-PCR and visualized on Agarose gel by electrophoresis. The size of the cDNA fragments on the Agarose gel reflected the length of the targeted sequences. As a rule of thumb, the start codon sites are more included in cDNA fragments above the size of 0.6K base pairs (b.p.), if mRNA sample of human cells were used. It is estimated that there are 30,000 to 40,000 expressed genes for the entire human genome (Baltimore, Nature 409: 816-818, 2001). Assuming the average length of an ORF is 1,320 b.p. with 30 5′-ATG sites (World Wide Website: kazusa.or.ip), 900,000 to 1,200,000 possible 5′-ATG sites of ORF were estimated. The design of codon-based oligonudeotide would allow producing genome-wide probes libraries; from which contain 226,981 distinctive 12-mer or 13,845,841 distinctive 15-mer oligonucleotide probes respectively (TABLE 16). The number of the designed probes is sufficient to target those 900,000 to 1,200,000 possible 5′-ATG sites and their immediate downstream sequences by hybridization. Argarose gel electrophoresis could help to filter out many fragments lacking non-start codon sites, typically those under 0.6 k b.p. in size. Technically, the density of 400,000 probes per individual DNA microarray could go to 40 million probes on one single DNA microarray soon (Gwynne et al., Science 294:641-677, 2001). Particularly, using the photolithographic process, a large number of oligonucleotide probes could be synthesized on the surface of a wafer without increasing the cost of microarrays (Fodor et al., U.S. Pat. No. 5,510,270, 1996). In practice, a complete set of 13,845,841 distinctive 5′-ATG oriented oligonucleotide probes (13,845,841.times.40) could be immobilized on 14 individual DNA microarrays in future. Those numbers of probe sequences are no more astronomical figures in reality.
Practically, the selection of initiation site as targeting site has certain advantages over 5′-cap regions. The method of targeting 5′-cap region of mRNA is Rapid Amplification of cDNA Ends (RACE). It has been described by Frohman et al., Proc. Natl. Acad. Sci. U.S.A. 85: 8998-9002, 1988; Maruyama et al., Gene 138: 171-174, 1994. RACE uses Calf Intestinal Phosphatase (CIP) to remove 5′-end phosphates of uncapped mRNA molecules while leaving the 5′-capped mRNA intact. Subsequently, Tobacco Acid Pyrophosphatase (TAP) is added to reaction to remove the 5′-cap of 5′-capped mRNA molecules. After removal, the 5′-phosphate of the uncapped mRNA is exposed to the environment. Then, the oligonucleotide designed as the PCR primer and T4 RNA ligase is added to the reaction. The 3′-hydroxyl group of the oligonucleotide will ligate to the 5′-phosphate group of the mRNA in a reaction catalyzed by T4 RNA ligase. Thus, RACE eliminates the uncapped mRNAs and selects ones with a 5′-cap for further PCR aided cloning. The disadvantage is that mRNA molecules with full-length sequence without a 5′-cap may be eliminated from samples. Furthermore, it is not unusual for there to be a several hundred base pair long distance between the 5′-cap and the start codon of mRNA in vertebrates. The relatively rich GC content of 5′-UTR in many cases suggests that a high degree of secondary structure may exist (Kozak, J. Cell Biol. 115: 887-903, 1991). That may lead to problems which will have a negative Impact on PCR priming from 5′-UTR adjacent to 5′-cap. It is also noted that non-template nucleotides could be added to 3′-ends of cDNAs during RACE process (Chen et al., Biotechniques 30: 574-582, 2001). Chen et al. has recommended that RACE should be used cautiously in determining the terminal sequences of nucleic acids.
The present invention allows targeting of the site of 5′-ATG to be substituted by any one of 61 amino acid coding codons for ORFS or 64 codons for 5′-UTRs and 3′-UTRs. Based on the present Inventive genomic algorithms, a given site of ORF or 5′-UTR or 3′-UTR and their corresponding downstream or upstream sequences could be targeted specifically by the inventive probes.
The study of the probabilities of priming site in DNA of 45,000 base pair indicated that P(O), the probability of no priming site of 12-mer oligonucleotides, is 0.995. P(1), the probability of exactly one priming site of 12-mer oligonucleotides, is 0.005. P(>1), the probability of more than one priming site of 12-mer oligonucleotides, is <10−4 (<10.sup.−4) (Studier, Proc. Natl. Acad. Sci. U.S.A. 86: 6917-6921, 1989). Theoretically, an oligonucleotide with the length of 15 to 18 mers or above could be able to detect a single copy gene from the human genomic DNA. In practice, a 12-mer oligonucleotide is capable of detecting an mRNA molecule. Long oligonucleotides (>10 mers) may decrease the specificity if its binding affinity is high (Herschlag et al., Proc. Natl. Acad. Scl. U.S.A. 88: 6921-6925, 1991).
It is known in the art that an oligonucleotide as short as a 6 mers could perform reliable hybridization (Drmanac et al., DNA and Cell Biology 9: 527-534, 1990) and prime efficiently (Feinberg et al., Anal. Biochem. 132: 6-13, 1983). The results of 6-mer Oligonucleotide arrays have been reported (Timofeev et al., Nucleic Acids Res. 29(12): 2626-2634, 2001). The advantage of using short oligonucleotide is the higher capacity of discriminating mismatches than longer probes in hybridization (Drmanac et al., DNA and Cell Biology 9: 527-534, 1990). Beattie et al. have demonstrated experimentally that 9-mer oligonucleotides tethered to glass were capable of capturing their complementary DNA strands as long as 1,300 bases in length with good discrimination against mismatches in hybridization (Beattie et al., Mol. Biotechnol. 4: 213-225, 1995). Recent research has demonstrated the usage of 9-mer oligonucleotide arrays in DNA fingerprinting (Reyes-Lopez et al., Nucleic Acids Res. 31(2): 779-789, 2003). 9-mer oligonucleotide has been proven to be sufficient to perform as a PCR primer in aqueous phase (Williams et al., Nucleic Acids Res. 18: 6531-6535, 1990). Additionally, if Locked Nucleic Acid hereinafter LNA had been incorporated, short oligonucleotides would exhibit increasing thermal stabilities towards complementary DNA and RNA in PCR and hybridization (Babu et al., Nucleic Acids Res. 22: 1317-1319, 2003). Milner et al. speculated that longer oligonucleotides might have internal base pairing which prevent duplex formation, or that duplex formation was inhibited by dangling ends of oligonucleotides that could not fit into the folded structure of mRNA (Milner et al., Nat. Biotechnol. 15: 537-541, 1997). Considering the increasing probability of forming secondary structure(s) that accompanies the increasing length of an oligonucleotide; a short oligonucleotide has distinct advantages over a longer one though longer ones are more specific. Short oligonucleotides are also relatively inexpensive and suitable for large-scale production.
In the art, some oligonucleotide probes and PCR primers were designed specifically against their corresponding template sequences directly. Some were designed based on nucleotides using the algorithm of 4.sup.n (n is the unit of measurement of the length of oligonucleotide. n represents nucleotide) that has been widely used and prevailed up to date. Algorithm of 4.sup.n has a fundamental impact on oligonucleotide designs and production though some were designed systematically and others were designed arbitrarily. Those are oligonucleotide probes for general usage (Studier, Proc. Natl. Acad. Sci. U.S.A. 86: 6917-6921, 1989) (Szybalski et al., Gene 90: 177-178, 1990), oligonucleotide probes for generic oligonulceotide mlcroarray (Llpshutz et al., Nature Genetics 21, 20-24, 1999) (Barinaga, Science 253:1489, 1991) as well as PCR primers for RT-PCR differential display (Liang et al., Science 257: 967-971, 1992).
The oligonucleotide library constructed by all possible combinations of A.T.G.C. according to algorithm of 4.sup.n was proposed (Studier, Proc. Natl. Acad. Sci. U.S.A. 86: 6917-6921, 1989) (Szybalski, Gene 90: 177-178, 1990). Huse introduced the concept of random tuplets in the method of oligonucieotide's synthesis. A tuplet can be a dinucleotide, a trinucleotide or can also be four or more nucleotides (Huse, U.S. Pat. No. 5,523,388, 1996 and U.S. Pat. No. 5,808,022, 1998). The proposal of synthesizing a diverse population of expressible oligonucleotides having a desirable bias of random codon sequences, which encode a desirable bias of amino acids, was suggested by Huse (Huse, U.S. Pat. App. No.2001/0024782, 2001) (Huse, U.S. Pat. No. 6,258,530, 2001). Huse proposed an algorithm of 20.sup.n (20″, n is the unit of measurement of the length of oligonucleotide. n represents nucleotide.) for the calculation of all possible combinations of four-nucleotide/bases of n-nucleotide-lenqth long oligonucleotide sequences. However, neither algorithm 4.sup.n nor 20.sup.n has orientation capacity. None of the oligonucleotides of the oligonucleotide library constructed in accordance with algorithm of 4.sup.n or algorithm of 20.sup.n could be able to discriminate the template strand (anti-sense) from non-template strand (sense) of a DNA double helix and vice versa in hybridization. Another disadvantage is that both algorithms inevitably include huge amounts of non-sense codons in the sequences of oligonucleotides that virtually do not exist in ORF. For example, for 6-mer oligonucleotides (six nucleotides in the length), 4,096 (4.sup.9) oligonucleotide sequences were deduced by Studier and Szybalski's method; 64,000,000 (20.sup.6) oligonucleotide sequences were deduced by Huse's method. Only 61 (61.sup.1) 5′-ATG oriented 6-mer oligonucleotide sequences were deduced by the inventive methods. Obviously, algorithm of 61.sup.(n−1) is the most effective one for designing oligonucleotide libraries. The redundant non-sense codons in probe sequences created by algorithm of 4.sup.n are problems when they were massively employed to target ORF sequences on HTS technology platforms such as DNA Microarrays. The negative impacts on noise control, fidelity, reliability and cost effective can hardly be ignored.
DNA Microarrays, a format of DNA Array technology platforms, is a systematic approach of detecting gene expression patterns in a quantitative, parallel, simultaneous and massive manner (Fodor et al., Science 251: 767-773, 1991); (Schena et al., Science 270: 467-470, 1995); (Fodor et al., U.S. Pat. No. 5,510,270, 1996 and U.S. Pat. No. 5,800,992, 1998); (Southern et al., U.S. Pat. No. 5,700,637, 1997); (Chu et al., Trends in Blotechnol. 17: 217-218, 1999). It usually consists of hundreds to thousands of known DNA sequences immobilized on a miniaturized solid surface as the probes. Each distinctive DNA sequence immobilized has its own well-defined position on the substrate. Through hybridization, DNA Microarrays could identify and demonstrate the responsive sequences, expression dynamics and patterns of genes of a given sample. It can visualize the results of the hybridization of thousands of cDNA molecules in one single experiment. Nucleic acids of a given test and control samples were usually previously labelled with fluorescent molecules, such as Cy3 and Cy5 respectively. There are cases wherein the nucleic acids of a given sample were radioactively labelled, for example with 33P, 32P and 33S. Oligonucleotide Arrays usually range in length from 4 mers to 80 mers. Though longer oligonucleotides are more specific, they are usually more costly to make and more difficult to accurately synthesize. Those oligonucleotides were either pre-synthesized or synthesized in situ. For example, Affymetrix's GeneChip arrays are synthesized by light-directed combinatorial chemical approaches which allow manufacture of high density oligonucleotide arrays consisting of above 0.5 million distinctive oligonucleotides on 1.2.times.1.2 cm.sup.2 glass surfaces (Fodor et al., Science, 251: 767-773, 1991). Concerning the probe design, Inc. has developed generic oligonucleotide arrays. The design was based on all possible combinations of four nucleotides or bases (A.T.G.C.) according to algorithm of 4.sup.n. (Lipshutz et al., Nat. Genet. 21: 20-24, 1999). In fact, it is the same model and system proposed by Studier. As a systematical approach, the disadvantages can hardly to be ignored. First, the oligonucleotide set or library constructed by all possible combinations of four nucleotides cannot discriminate target sequences among non-coding, coding and regulatory regions. Second, even within a targeting coding region, template strand (anti-sense) and non-template strand (sense) would be targeted indifferently by those generic oligonucleotides in hybridization. Third, one of the analytical areas of gene functionality is in coding regions, but the algorithm of 4.sup.n is not a codon-based approach. The redundancy is phenomenal and hinders the accuracy of hybridization. It increases the cost of production and complicates the operation. For example, for 24-mer oligonucleotides, the number of all possible combinations of oligonucleotides based on algorithm of 61.sup.(n−1) is 382,742,836,021 [61.sup.(8−1)] while the number of all possible combinations of oligonucleotides based on algorithm of 4.sup.n Is 281,474,976,710,656 (4.sup.24). The relationship between codon and nucleotide regarding the length of an oligonucleotide is as follows: n-codon-length long oligonucleotide equals 3.times. n-nucleotide-length long oligonucleotide. n represents codon while 3 multiply n represents nucleotide. The redundancy is 89.6 times more than the virtual ORF sequences (TABLE 17). Furthermore, producing a 24-mer oligonucleotide library for oligonucleotide arrays by the present invention is 89.6 times more cost-effective in production than the design based on algorithm of 4.sup.n. That efficiency will increase further with the elongation of the length of oligonucleotide following algorithm of 4.sup.3.times.n divided by 61.sup.(n−1) (TABLE 17). The redundancy of oligonucleotide sequence with specified length could be calculated in accordance with the algorithm of 43n−61(n−1) (TABLE 17). Since the generic oligonucleotide arrays were constructed according to algorithm of 4.sup.n., the GC contents among the oligonucleotide probes vary from 0% to 100%. Once thousands of oligonucleotides with variable GC content are immobilized on one piece of solid support, all of them will be exposed to a unique hybridization environment. Thus, a considerable number of the oligonucleotide probes may have to hybridize under un-optimized conditions. Consequently, false positive or negative hybridization results might be produced. Applying 2.4 to 3.0 M tetramethyl ammonium or tetraethyl ammonium chloride (Wood et al., Proc. Nati. Acad. Sci. U.S.A. 82: 1585-1588, 1985) as buffer (Fodor et al., U.S. Pat No. 6,197,506, 2001) may reduce some effects of the GC bias in hybridization to a certain degree. However, the effect of such reagents has its limitations.
The standardization and optimization of oligonucleotide probe design are among the major challenges to DNA Array developers. A novel, standardized and universal probing system (libraries) having the capacity of genuine genome-wide screening without any redundancy is desirable. Ideally, the standardized oligonucleotide arrays are all-purpose probe platforms. It can target regardless of genetic variations among cells, tissues, organs, individual, species and diverse life processes such as various pathways of both normal and pathological states. It is capable of detecting and targeting all low-abundance transcripts as well as medium and high-abundance transcripts of a given nucleic acid sample at the same time. The targeting range could include all possible endogenous and exogenous genes known and unknown simultaneously and systematically for a given nucleic acid sample. Tactically, the one-for-all approach Is one of the most effective and economical designs for both users and manufactures of DNA Arrays.
The citation of a reference herein and hereafter shall not be construed as an admission that such reference is prior art to the present invention.