Genomes in a population are polymorphic, giving rise to diversity and variation. In cancer, even somatic cell genomes can rearrange themselves, often resulting in parts of the genome getting deleted (hemi- or homozygously), causing a decrease in copy number of a gene, or getting amplified, causing an increase in copy number of a gene. The ability to study these chromosomal aberrations quickly, inexpensively and accurately has important and useful scientific, clinical and therapeutic implications, particularly, in the genomics of cancer and inherited diseases. See Lucito et al., 2000, Genome Research 10:1726-1736; Mishra, 2002, Computing in Science and Engineering 4:42-49. Cancer genome-based methods, in contrast to gene expression-based methods, are able to take advantage of the fact that DNA is a stable component of a cancer cell that does not vary as a function of the cell's physiological state. Even though they may be crude, karyotyping, determination of ploidy, and comparative genomic hybridization provide an arsenal of useful biotechnological tools for this purpose, and produce data that demand treatment by sophisticated statistical algorithms to be reliable clinical guides for diagnosis and treatment.
Microarray methods appear to be the most dominant of the new technologies used to study variations between normal and cancer genomes. One can sample the genome uniformly (independent identically distributed) and reproducibly to create a large number of oligonucleotides—on the order of 100,000 probes—located every 30 Kb or so. These oligonucleotides, which are short subsequences of hundreds of base pairs or even smaller, are from regions of the genome that do not share homologous sequences elsewhere in the genome, so each probe is likely to occupy a unique position in the normal genome, and likely has exactly two copies. One such oligonucleotide may belong to a region in the cancer genome that has an altered copy number, e.g. c, 0≦c≠2. When the cancer genome is sampled, this oligonucleotide would likely occur with a probability that is c/2 times that in the normal genome. Thus, the copy number can be computed by a ratiometric measurement of the abundance of an oligonucleotide in cancer sample measured against that in the normal genome. The ideas described for a single oligonucleotide can be generalized to measure the copy number variations for all the probes simultaneously with high-throughput microarray experiments. While the ratiometric measurements and normalizations may minimize the multiplicative noise in the system, a large amount of uncharacterized noise, mostly additive, remains that may render the data worthless without the use of a proper data-analysis procedure. Furthermore, since the data may come from multiple sources, collected with varying protocols and subjected to vagaries of the technologies employed for its collection, the procedure should be general, and based on a minimal set of prior assumptions regarding the methods.