The present invention relates to methods for generating differential expression profiles by combining expression data obtained in separate microarray measurements. The invention also relates to methods for determination and removal or reduction of systematic measurement biases between different microarrays.
DNA array technologies have made it possible to monitor the expression level of a large number of genetic transcripts at any one time (see, e.g., Schena et al., 1995, Science 270:467-470; Lockhart et al., 1996, Nature Biotechnology 14:1675-1680; Blanchard et al., 1996, Nature Biotechnology 14:1649; Ashby et al., U.S. Pat. No. 5,569,588, issued Oct. 29, 1996). Of the two main formats of DNA arrays, spotted cDNA arrays are prepared by depositing PCR products of cDNA fragments with sizes ranging from about 0.6 to 2.4 kb, from full length cDNAs, ESTs, etc., onto a suitable surface (see, e.g., DeRisi et al., 1996, Nature Genetics 14:457-460; Shalon et al., 1996, Genome Res. 6:689-645; Schena et al., 1995, Proc. Natl. Acad. Sci. U.S.A. 93:10539-11286; and Duggan et al., Nature Genetics Supplement 21:10-14). Alternatively, high-density oligonucleotide arrays containing thousands of oligonucleotides complementary to defined sequences, at defined locations on a surface are synthesized in situ on the surface by, for example, photolithographic techniques (see, e.g., Fodor et al., 1991, Science 251:767-773; Pease et al., 1994, Proc. Natl. Acad. Sci. U.S.A. 91:5022-5026; Lockhart et al., 1996, Nature Biotechnology 14:1675; McGall et al., 1996, Proc. Natl. Acad. Sci. U.S.A. 93:13555-13560; U.S. Pat. Nos. 5,578,832; 5,556,752; 5,510,270; and 6,040,138). Methods for generating arrays using inkjet technology for in situ oligonucleotide synthesis are also known in the art (see, e.g., Blanchard, International Patent Publication WO 98/41531, published Sep. 24, 1998; Blanchard et al., 1996, Biosensors and Bioelectronics 11:687-690; Blanchard, 1998, in Synthetic DNA Arrays in Genetic Engineering, Vol. 20, J. K. Setlow, Ed., Plenum Press, New York at pages 111-123). Efforts to further increase the information capacity of DNA arrays range from further reducing feature size on DNA arrays so as to further increase the number of probes in a given surface area to sensitivity- and specificity-based probe design and selection aimed at reducing the number of redundant probes needed for the detection of each target nucleic acid thereby increasing the number of target nucleic acids monitored without increasing probe density (see, e.g., Friend et al., U.S. patent application Ser. No. 09/364,751, filed on Jul. 30, 1999; and Friend et al., U.S. patent application Ser. No. 09/561,487, filed on Apr. 28, 2000).
By simultaneously monitoring tens of thousands of genes, DNA array technologies have allowed, inter alia, genome-wide analysis of mRNA expression in a cell or a cell type or any biological sample. Aided by sophisticated data management and analysis methodologies, the transcriptional state of a cell or cell type as well as changes of the transcriptional state in response to external perturbations, including but not limited to drug perturbations, can be characterized on the mRNA level (see, e.g., Stoughton et al., International Publication No. WO 00/39336, published Jul. 6, 2000; Friend et al., International Publication No. WO 00/24936, published May 4, 2000). Applications of such technologies include, for example, identification of genes which are up regulated or down regulated in various physiological states, particularly diseased states. Additional exemplary uses for DNA arrays include the analyses of members of signaling pathways, and the identification of targets for various drugs. See, e.g., Friend and Hartwell, International Publication No. WO 98/38329 (published Sep. 3, 1998); Stoughton, International Publication No. WO 99/66067 (published Dec. 23, 1999); Stoughton and Friend, International Publication No. WO 99/58708 (published Nov. 18, 1999); Friend and Stoughton, International Publication No. WO 99/59037 (published Nov. 18, 1999); Friend et al., U.S. patent application Ser. No. 09/334,328 (filed on Jun. 16, 1999).
The various characteristics of this analytic method make it particularly useful for directly comparing the abundance of mRNAs present in two cell types. For example, an array of cDNAs was hybridized with a green fluor-tagged representation of mRNAs extracted from a tumorigenic melanoma cell line (UACC-903) and a red fluor-tagged representation of mRNAs was extracted from a nontumorigenic derivative of the original cell line (UACC-903 +6). Monochrome images of the fluorescent intensity observed for each of the fluors were then combined by placing each image in the appropriate color channel of a red-green-blue (RGB) image. In this composite image, one can see the differential expression of genes in the two cell lines. Intense red fluorescence at a spot indicates a high level of expression of that gene in the nontumorigenic cell line, with little expression of the same gene in the tumorigenic parent. Conversely, intense green fluorescence at a spot indicates high expression of that gene in the tumorigenic line, with little expression in the nontumorigenic daughter line. When both cell lines express a gene at similar levels, the observed array spot is yellow.
In some cases, visual inspection of such results is sufficient to identify genes which show large differential expression in the two samples. A more thorough study of the changes in expression requires the ability to discern quantitatively changes in expression levels and to determine whether observed differences are the result of random variation or whether they are likely to reflect changes in the expression levels of the genes in the samples. Assuming that DNA products from two samples have an equal probability of hybridizing to the probes, the intensity measurement is a function of the quantity of the specific DNA products available within each sample. Locally (or pixelwise), the intensity measurement is also a function of the concentration of the probe molecules. On the scanning side, the fluorescent light intensity also depends on the power and wavelength of the laser, the quantum efficiency of the photomultiplier tube, and the efficiency of other electronic devices. The resolution of a scanned image is largely determined by processing requirements and acquisition speed. The scanning stage imposes a calibration requirement, though it may be relaxed later. The image analysis task is to extract the average fluorescence intensity from each probe site (e.g., a cDNA region).
The measured fluorescence intensity for each probe site comes from various sources, e.g., background, cross-hybridization, hybridization with sample 1 or sample 2. The average intensity within a probe site can be measured by the median image value on the site. This intensity serves as a measure of the total fluors emitted from the sample mRNA targets hybridized on the probe site. The median is used as the average to mitigate the effect of outlying pixel values created by noise.
Typically, in a two-color microarray gene expression experiment, the experiment sample is labeled in one dye color (Cy5, red) and the control sample is labeled in a different color (Cy3, green). The two samples are mixed and hybridized to a micro-array slide. After hybridization, the expression intensity is measured with a laser scanner of two different colors. The experiment is conducted in a biology laboratory (wet lab). To obtain the expression profile, we compute the logarithmic ratio of the two measured intensities (red and green).
There are two types of biases (errors) that may affect the accuracy of the ratio estimation, inter-slide bias and color bias. Inter-slide bias is the difference between two separated slides. The two-color technique avoids the inter-slide error by running the experiment in a single slide. But different dyes can cause difference between the two intensity measurements, so that the ratio is biased. To overcome this color bias problem, the experiment can be run twice with reversed flourescent dye labeling from one to the other. The two expression ratios are then combined to cancel out the color bias. A method for calculating individual errors associated with each measurement made in repeated microarray experiments was also developed. The method offers an approach for minimizing the number of times a cellular constituent quantification experiment must be repeated in order to produce data that has acceptable error levels and for combining data generated in repeats of a cellular constituent quantification experiment based on rank order of up-regulation or down-regulation. See, e.g., Stoughton et al., U.S. patent application Ser. No. 09/222,596 (filed on Dec. 28, 1998).
However, it is often desirable to know without actually running the experiment in the lab the difference of expression levels of genes between samples under two different conditions, such as condition A vs. condition B (A vs. B), when only separately measured experimental data A vs. C and B vs. D are available. There is therefore a need for methods for generating differential profiles, such as A vs. B, from separately measured data, such as A vs. C and B vs. D. In particular, because of the systematic measurement errors resulted from variations between two separate experiments and thus between the two separately measured data, there is a need for methods that make use of experimental data for estimating and reducing such systematic errors.
Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.
The invention provides methods for generating a differential profile A vs. B from measured data obtained under condition A vs. condition C (A vs. CA) and condition B vs. condition C (B vs. CB) measured in two separate experimental reactions. In the methods of the invention, the systematic measurement error or bias between the two different experiments, i.e., cross-experiment errors or biases, is estimated and removed using the data measured with the samples having been subject to the common condition, e.g., condition C. Specifically, a same-type (ST) differential profile CA vs. CB is formed using the two sets of separately measured data of sample having been subject to condition C. The inter-slide bias or error is then corrected by making use of this ST profile. In a preferred embodiment, the invention provides a method for generating an error-corrected differential profile A vs. B from sets of data A, B, CA, and CB, comprising (a) calculating a first differential profile A vs. B; (b) determining a systematic cross-experiment error by a method comprising calculating a reference differential profile CA vs. CB; and (c) generating a second differential profile A vs. B by a method comprising correcting said first differential profile A vs. B using said determined systematic cross-experiment error; wherein said data set A, B, CA or CB comprises respectively data set {A(i)}, {B(i)}, {CA(i)} or {CB(i)} representing measurements of a plurality of different cellular constituents measured in a sample, said sample having been subject to a respective condition A, B, C or C, wherein i=1, 2, . . . , N is the index of measurements of cellular constituents, N being the total number of measurements; wherein data sets A and CA are measured in the same experimental reaction, and data sets B and CB are measured in the same experimental reaction; and wherein said second differential profile is taken as said error-corrected differential profile A vs. B. In the methods of the invention, inter-slide error is estimated statistically by C vs. C from a plurality of data points, e.g., array spots. Therefore, in embodiments of the invention, the total number of data points N for each data set used in the methods of the invention is preferably at least 100, more preferably at least 1000, even more preferably at least 10,000. In some embodiment, N is smaller than the total number of spots in the array. In some other embodiments, a data set can contain more than one measurement of the same cellular constituent. For example, a data set of measured levels of gene expression can contain the expression level of a gene measured by two or more different probes for the gene in a microarray. Preferably, the methods are used to generate differential profile A vs. B when both CA and CB are labeled with the same fluorophore. However, the methods can also be used to generate differential profile A vs. B when CA and CB are labeled with different fluorophores. In such embodiments, it is preferable that the fluorophore bias between CA and CB are removed before used in generating the ST profile CA vs. CB. More preferably, the methods are used to generate differential profile A vs. B when A and B are labeled with a first fluorophore and CA and CB are labeled with a second fluorophore which is different from the first fluorophore.
In one embodiment, the inter-slide bias is removed by subtracting the ST log ratio CA vs. CB from the log ratio A vs. B. The subtraction is carried out by minimizing an objective function, i.e., a log-ratio-error normalized log-ratio difference weighted by a factor w, for the inter-slide error minimization process. In another embodiment, the inter-slide bias is removed by subtracting the ST arithmetic difference CA vs. CB, i.e., CBxe2x88x92CA, from the arithmetic difference A vs. B. The subtraction, including scaling of the ST profile, is carried out by a method similar to the method for subtraction of log(ratio). In still another embodiment, the inter-slide bias is removed by subtracting the ST ratio CA vs. CB, i.e., CB/CA, from the ratio A vs. B. The subtraction, including scaling of the ST profile, is carried out by a method similar to the method for subtraction of log(ratio).
In preferred embodiments, the generated expression profile A vs. B are further corrected for fluorophore bias. As described, supra, the two-color fluorescent hybridization process introduces bias into the profile analysis because each species of mRNA that is labeled with fluorophore has a bias in its measured color ratio due to interaction of the fluorescent labeling molecule (fluorophore) with either the reverse transcription of the mRNA or with the hybridization efficiency or both. Such a bias is also present in the generated expression profile A vs. B if samples under conditions A and B are labeled with different fluorophores. Thus, in one embodiment, if the fluor-reversed profile B vs. A is also generated, the fluorophore bias is removed by combining the pair of fluor-reversed profiles using any method known in the art.
The invention also provides methods for generating differential expression profile A(T1) vs. A(T2) from data measured at different hybridization times T1 and T2, i.e., different lengths of hybridization durations, in two separate measurements, thereby comparing expression data measured at the two hybridization times. In one embodiment, a differential expression profile A(T1) vs. A(T2) is generated from data sets A(T1) and A(T2) measured in single-channel experiments of A at hybridization times T1 and T2. In another embodiment, a differential expression profile A(T1) vs. A(T2) is generated from A(T1) vs. C(T1) and A(T2) vs. C(T2) measured in two separate two-channel experiments of A vs. C at hybridization times T1 and T2. Such methods are useful when changes in hybridization levels in time are to be determined, e.g., in methods in which hybridization kinetics is used for distinguishing hybridization specificity at different hybridization time. In preferred embodiments, the first hybridization level can be measured at between 1 to 10 hours, whereas the second hybridization time can be measured at about 2, 4, 6, 10, 12, 16, 18, 48 or 72 times as long as the first hybridization time. The invention thus provides a method for correcting any systematic errors that may arise between measurements carried out at different hybridization times.
In another embodiment, the invention provides a method for controlling the quality of microarray slide production process. The method is based on comparing two-channel measured data of samples under the same condition, e.g., C vs. C. In the method, one good quality slide is selected to serve as a standard. A second microarray slide is then randomly selected from a batch of production slide. Two identical same-type virtual experiments C vs. C for both slides are then generated. A quantitative production quality control process is established by first computing a correlation coefficient using the intensity ratio of the first virtual experiment (C vs. C with color label 1) and the intensity ratio of the second virtual experiment (C vs. C with color label 2) by an inter-slide correlation method, and then judging the quality of microarray slides by using a predetermined range of correlation coefficient. For example, the range of acceptable correlation coefficient can be set to be between xe2x88x920.5 and 0.5.
The invention also provides methods for generating differential profiles using data from two separate single channel measurements, e.g., measurements from two microarray slides. In one embodiment, an expression profile A vs. B is generated by combining data from two measured single-channel data A and B. In another embodiment, an expression profile A vs. B is generated by combining single-channel data A and B picked up from the separately measured two-channel data A vs. C and B vs. D. In still another embodiment, an expression profile A vs. B is generated by combining single-channel data A and B picked up from two separately measured N-channel data, one containing A and one containing B. In preferred embodiments, data A and B are from channels of the same color in two different slides. Measurement errors in the two channels are removed by removing the additive noise in both channels. When A and B are measured in channels of different colors, color bias is also removed. In a preferred embodiment, the invention provides a method for generating a differential profile A vs. B from data sets A and B, comprising (a) determining mean background noise levels Abkg and Bbkg, and background noise residue ABres, from measured background noise levels in data sets A and B, respectively; (b) calculating noise-removed data sets A and B, respectively, by a method comprising (b1) removing said mean background noise level from said data sets A and B, and (b2) removing said background noise residue from said data sets A and B, respectively; and (c) generating said differential profile A vs. B from said noise-removed data sets A and B; wherein said data set A or B comprises respectively data set {A(i), Abkg(i)} or {B(i), Bbkg(i)} representing measurements of a plurality of different cellular constituents in a sample, said sample having been subject to condition A or B, respectively; wherein Abkg(i) or Bbkg(i) is said measured background noise level of measurement of cellular constituent i in said data set A or B, respectively; and wherein i=1, 2, . . . , N is the index of measurements of cellular constituents, N being the total number of measurements. In some embodiments, the procedure for removing the background noise residue from data sets A and B is carried out once. In preferred embodiments, the procedure is repeated several times, such as 5, 10, or 20 times, to further reduce any remaining residuals. In one embodiment, the sample having been subject to condition A and the sample having been subject to condition B are labeled with the same fluorophore. In another embodiment, the sample having been subject to condition A is labeled with a first fluorophore and the sample having been subject to condition B is labeled with a second fluorophore, and the second fluorophore is different from the first fluorophore.
The invention also provides a computer system for carrying out the method of the invention of generating a differential profile, said computer system comprising a processor, and a memory coupled to said processor and encoding one or more programs, wherein said one or more programs cause the processor to carry out any of the method of the present invention.
The invention also provides a computer program product for use in conjunction with the computer system of the invention having a processor and a memory connected to the processor, said computer program product comprising a computer readable storage medium having a computer program mechanism encoded thereon, wherein said computer program mechanism may be loaded into the memory of said computer and cause said computer to carry out any of the method of present invention.