The present invention relates to methods and apparatus for analyzing data extracted from biomolecular arrays.
Devices and methods for generating and analyzing arrays of biomolecules such as nucleic acids are known. Using such methods, nucleic acid probes are fabricated or xe2x80x9cspottedxe2x80x9d at an array of known locations on a substrate. A target sample containing one or more unknown fluorescence-labeled nucleic acids is then introduced onto the substrate and a scanner observes fluorescence resulting from binding or hybridization of the target with probes on the substrate. If the target sample includes RNA sequences taken from an organism or complementary DNA sequences produced from such RNA sequences, such experiments can be used to identify genes expressed in the organism.
Improved methods are needed to reliably evaluate the information resulting from use of these techniques.
The invention features computer-implemented methods and apparatus, including computer program products, operable for extracting, analyzing and comparing molecular affinity data from experiments.
In general, in one aspect, the invention features a method of analyzing an experiment involving the reaction of a target sample with a biomolecular array including one or more probe sets, where each probe set is a collection of one or more probes. The method includes receiving a set of values and using the set of values to generate for each probe set a first probability value indicating the probability that a molecular species complementary to the probe set is not present in the target sample. Each of the set of values corresponds to the interaction of the target sample with one of the probes.
Implementations of the invention can include one or more of the following advantageous features. The first probability value for a probe set is generated by comparing the values for the probes in the probe set with a value for a calibration set. The calibration set includes a null set. Comparing the values for the probes in the probe set with the value for the calibration set includes using a likelihood ratio algorithm. Comparing the values for the probes in the probe set with the value for the calibration set includes using an order statistics algorithm. The method includes using the set of values to generate for at least one probe set one or more second probability values indicating the probability that the molecular species complementary to the probe set is present at a level greater than or less than a reference concentration. The one or more second probability values for a probe set are generated by comparing the values for the probes in the probe set with values for the calibration set. The method includes using the first and second probability values to generate a normalized value for at least one probe set. Using the first and second probability values to generate a normalized value includes generating an interpolated value on a curve defined by the values for the calibration set. The calibration set includes a set of spiked probes. The calibration set includes a set of endogenous housekeeping genes. The collection of probes includes a collection of nucleic acid probes and using the set of values includes generating for at least one probe set a first probability value indicating the probability that a nucleic acid transcript complementary to the probe set is not present in the target sample. The method includes outputting an output file including the normalized value for at least one probe set. The normalized value for the at least one probe set includes a gamma-normalized value for the at least one probe set. The normalized value for the at least one probe set includes a principal component score for the at least one probe set. The normalized value for the at least probe set includes a cumulative distribution function score for the at least probe set. The method includes outputting an output file comprising the first and second probability values for at least one probe set.
In general, in another aspect, the invention features a method of analyzing a plurality of experiments involving the reaction of a target sample with a biomolecular array including one or more probe sets, where each probe set is a collection of one or more probes. The method includes receiving a plurality of data files, using the values to generate a normalized score for at least one probe set in two or more of the plurality of experiments, and comparing one or more normalized scores for two or more of the plurality of experiments. Each data file includes a set of values for an experiment. Each value corresponds to the interaction of the target sample with one of the probes for one of the plurality of experiments.
Implementations of the invention can include one or more of the following advantageous features. The normalized score includes a gamma-normalized value. The normalized score includes a principal component score. The normalized score includes a common factor score. The normalized score includes an empirical cumulative distribution score. At least one of the plurality of data files is received from a database.
In general, in another aspect, the invention features a method of generating a calibration set for use in analyzing a plurality of experiments involving the reaction of a target sample with a biomolecular array, where each array includes one or more probe sets and each probe set is a collection of one or more probes. The method includes receiving a plurality of data files, identifying from the data files one or more values corresponding to a set of null probes, a plurality of values corresponding to a predetermined set of spiked probes, and a plurality of values corresponding to a predetermined set of housekeeping probes expected to be present in the target sample for each of the plurality of experiments at a plurality of approximately known concentrations. Each data file includes a set of values for an experiment. Each value corresponds to the interaction of the target sample with one of the probes for one of the plurality of experiments.
Implementations of the invention can include one or more of the following advantageous features. The method includes using the values for the set of null probes, the set of spiked probes and the set of housekeeping probes to calculate a normalized score for each of the probe sets, and using the normalized scores to identify a new set of housekeeping probes.
In general, in another aspect, the invention features a computer program on a computer-readable medium for analyzing an experiment involving the reaction of a target sample with a biomolecular array, where the array includes one or more probe sets and each probe set is a collection of one or more probes. The program includes instructions operable to cause a programmable processor to receive a set of values and use the set of values to generate for at least one probe set a first probability value indicating the probability that a molecular species complementary to the probe set is not present in the target sample. Each value corresponds to the interaction of the target sample with one of the probes.
Implementations of the invention can include one or more of the following advantageous features. The instructions operable to generate the first probability value for a probe set include instructions operable to compare the values for the probes in the probe set with a value for a calibration set. The calibration set includes a null set. The instructions operable to compare the values for the probes in the probe set with the value for the calibration set include instructions to use a likelihood ratio algorithm. The instructions operable to compare the values for the probes in the probe set with the value for the calibration set include instructions to use an order statistics algorithm. The computer program includes instructions operable to cause a programmable processor to use the set of values to generate for at least one probe set one or more second probability values indicating the probability that the molecular species complementary to the probe set is present at a level greater than or less than a reference concentration. The instructions operable to use the set of values include instructions operable to compare the values for the probes in the probe set with values for the calibration set. The computer program includes instructions operable to cause a programmable processor to use the first and second probability values to generate a normalized value for at least one probe set. The instructions operable to use the first and second probability values to generate a normalized value include instructions operable to generate an interpolated value on a curve defined by the values for the calibration set. The calibration set includes a set of spiked probes. The calibration set includes a set of endogenous housekeeping genes. The collection of probes includes a collection of nucleic acid probes and the instructions operable to use the set of values include instructions operable to generate for at least one probe set a first probability value indicating the probability that a nucleic acid transcript complementary to the probe set is not present in the target sample. The computer program includes instructions operable to cause a programmable processor to output an output file including the normalized value for at least one probe set. The normalized value for the at least one probe set includes a gamma-normalized value for the at least one probe set. The normalized value for the at least one probe set includes a principal component score for the at least one probe set. The normalized value for the at least one probe set includes a cumulative distribution function score for the at least one probe set. The computer program includes instructions operable to cause a programmable processor to output an output file comprising the first and second probability values for at least one probe set.
In general, in another embodiment, the invention features a computer program on a computer-readable medium for analyzing a plurality of experiments involving the reaction of a target sample with a biomolecular array, where each array includes a one or more probe sets, and where each probe set is a collection of one or more probes. The computer program includes instructions operable to cause a programmable processor to receive a plurality of data files, use the values to generate a normalized score for at least one probe set in two or more of the plurality of experiments, and compare one or more normalized scores for two or more of the plurality of experiments. Each data file includes a set of values for an experiment. Each value corresponds to the interaction of the target sample with one of the probes for one of the plurality of experiments.
Implementations of the invention can include one or more of the following advantageous features. The normalized score includes a gamma-normalized value. The normalized score includes a principal component score. The normalized score includes a common factor score. The normalized score includes an empirical cumulative distribution score. At least one of the plurality of data files is received from a database.
In general, in another aspect, the invention features a data analysis system for analyzing an experiment involving the reaction of a target sample with a biomolecular array, where the array includes one or more probe sets, and where each probe set is a collection of one or more probes. The system includes means for receiving a set of values and means for using the set of values to generate for at least one probe set a first probability value indicating the probability that a molecular species complementary to the probe set is not present in the target sample. Each value corresponds to the interaction of the target sample with one of the probes.
Implementations of the invention can include one or more of the following advantageous features. The system includes means for using the set of values to generate for at least one probe set one or more second probability values indicating the probability that the molecular species complementary to the probe set is present at a level greater than or less than a reference concentration.
In general, in another embodiment, the invention features a database on a computer readable medium. The database includes a set of experimental values derived from an experiment and a normalized value for at least one probe set. The experiment includes the reaction of a target sample with a biomolecular array. The arrays include one or more probe sets. Each probe set is a collection of one or more probes. Each value in the set of experimental values corresponds to the interaction of the target sample with one of the probes. The normalized value is derived from the experimental values for the probes of the probe set.
Implementations of the invention may include one or more of the following advantageous features. The normalized values are gamma normalized values. The normalized values are principal component scores. The normalized values are cumulative distribution function scores. The database includes a set of probability values for at least one probe set for comparing the concentration of a molecular species in the target sample complementary to the probe set with a plurality of concentrations for probes in a calibration set.
Advantages that can be seen in implementations of the invention include one or more of the following. Derivation of probability values from measured properties enables the user to assess the quality of data and the likely error in the experiment. Foreign material, such as exogenous nucleic acid sequences, introduced at known concentrations provides a means for quantifying concentration of complementary species in a target sample. Endogenous nucleic acid sequences, expected or determined to be present at known and stable concentrations across samples, are used to quantify concentration levels for data derived from experiments involving one or more arrays. Statistical methods are used to normalize experimental data, compensating for systematic variation between experiments and allowing the comparison of data from multiple chips, including data resulting from experiments conducted at different times or under different conditions. Storage and use of normalized data enables the user to create and use databases of experimental information, making possible the comparison of large quantities of data from diverse sources. Sets of normalized database information from multiple samples can be used to design virtual or ad hoc experiments involving the comparison of samples selected for specific characteristics. Data used in such experiments can be dynamically normalized across the selected data sets to further improve the accuracy of data comparison and to indicate the quality of the data sets.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features and advantages of the invention will become apparent from the description, the drawings, and the claims.