The present teachings relate to methods and systems for high-confidence utilization of large-scale datasets.
The recent sequencing of large number of genomes including human and development of arraying and other high-throughput technologies has resulted in increasing utility of these advances to study organismal scale data (cells, tissues, organisms etc.). With these advances and increasing output of large-scale and high-throughput data has increased need for methods and systems to utilize the data with high confidence (i.e., reduce false discovery) to optimally allocate resources for further development of concepts, hypotheses, technologies and products. Many of these technologies have been developed in the last decade and their quality is constantly improving, and so are the tools to utilize the datasets and to further refine the technologies. Here a few concepts and tools are presented that satisfy some of the needs of the latter goals.
Many systems used in large-scale measurements of organismal/cellular state involves multiple independent measurements of each parameter (e.g., genes/transcripts/proteins etc.). Two common forms of this type of technology that are widely used are (i) GeneChip® (Affymetrix, Calif.), where each transcript of a genome is measured using multiple independent probes, with each probe having a corresponding mismatch probe to estimate cross-hybridization—the former called a perfect match (PM) probe and the latter mismatch probe (MM)—(well described in patents and literature; e.g; U.S. Pat. Nos. 6,551,784, 6,303,301) (ii) typical measures of mixtures of proteins as peptide fragments using several variations mass spectrometry (e.g., Washburn et. al., 2001 and many variations for direct and comparative applications). A variety of applications of this type of multiple independent measurements of each parameter are currently in use and can be envisaged. Due to well documented prior knowledge (in literature and in patents) and evolving applications, the use of the technologies and generation of the data are not described here.
Most biological experiments (due to limitations of biological and other resources) utilizing such high-throughput data generation systems are conducted with small number of replicates. When possible the resultant data is analyzed using statistical or mathematical principles (for example to detect differentials between datasets exploring different conditions) to increase the confidence of the downstream steps used. But, the small number of replicates significantly reduce the statistical power in the analyses. In principle, the utilization of the independent measures of each parameter should alleviate significant part of this problem (at least in terms of improving power with respect to technical aspects of all steps of the process—e.g., manufacturing, handling, hybridization etc.). In the utilization of multiple independent measures there is a need for an understanding of the system specific properties and the behavior of the different parameters used in such analyses with respect to each other. Conversely, understanding properties of such datasets would help design better measurement technologies.
Whether applied to datasets with design principles similar to above example (multiple measures of each parameter under each condition) or otherwise the datasets across different conditions and replications comparable should be available. This step in data analysis is usually termed normalization (in this document used to represent the step after pre-processing data for technological design and data-collection specific effects, e.g., background correction). A good normalization is prerequisite to all further analysis and interpretations of the data.
The above brief background outlines the need i.e., constantly evolving technology and newer algorithms being proposed and no uniform or consensus approach been accepted and even lesser methods are accepted and predictably useful in dealing with multiple independent measures of each parameter (without an intermediate processing into a unified model based summary) highlights the need for improvements that would satisfy the many emerging needs in efficient and productive utilization of the deluge of data being generated in life sciences and other fields, and sets the stage for one kind of dataset being part of the invention.