Recent advances in the field of miniaturization have led to an increase in the speed and efficiency of high-throughput molecular assays. However, when such high-throughput technology is used, manual data analysis is rarely feasible in view of the large number of samples that can be processed during a single experiment. As such, computers play a central role in both the processing and analysis of data generated from high throughput experiments.
One area where miniaturization has had a profound effect is in the field of nucleic acid research. In particular, improvements in microarray technology and other comparable high-throughput systems have facilitated vast increases the number of nucleic acid samples that can be simultaneously processed. The field of genotyping has particularly benefited from miniaturization technology.
Genotyping is a branch of nucleic acid research in which a set of genetic markers (loci) in an individual are analyzed to determine the individual's genetic composition. In humans and other organisms, the nucleotide sequence of each genetic locus is largely identical between individuals. However, in some loci there exists one or more portions of nucleotides which exhibit some variation between individuals. Two variants of the same genetic locus are referred to as alleles. The most common type of genetic variation among humans and other organisms is the single nucleotide polymorphism (SNP). A SNP is a single nucleotide variation among individuals in a population that occurs at a specific nucleotide position within a locus. In humans, about 1.42 million SNPs are estimated to be distributed throughout the genome and at least 60,000 of these SNPs are thought to be in the coding portions of genes (The International SNP Map Working Group (2001) Nature 409:928-933). Determining whether an individual possesses one or more of these SNPs can be used to, among other things, determine that individual's risk of having certain diseases as well as determine that individual's relationship to other individuals. Microarray technology permits the analysis of thousands of specific genetic markers from multiple individuals all on a single device.
Due to the large number of DNA samples that are processed using high-throughput technology, automated systems have been heavily utilized to perform many facets of genotyping analyses, including genotype clustering and identification. In such systems where genotyping is automated, it is of paramount importance to have reproducible clusters reflecting whether individuals are homozygous or heterozygous for a particular allele. Depending on the analytical methods used, factors such as, intensity changes, cross-talk between channels, and intensity offsets, if left untreated, can alter the location of genotype clusters, and thereby skew the results of the genotyping analysis. Accordingly, practitioners have used several methods to compensate for factors that affect the proper clustering of genotypes from a genotyped data set.
One way in which variation from experimental factors is treated is by normalizing raw genotype data based on a set of external control samples. However, such methods rely on the assumption that the nature of the controls do not change from sample to sample. Because this assumption is not usually true, normalization with external controls provides only a marginally effective means to limit data variation. Furthermore, on occasion normalizing using external controls can deteriorate the quality of the data. As such, there is need to provide an improved method for normalizing genotyping data.
Another issue associated with automated genotyping systems is that highly accurate genotype calling is not always achieved. For example, one way to evaluate genotype data is by comparing the signal intensity of one allele against another. After normalization, the data points are typically subjected to some form of cluster analysis whereby the data set is divided into specific regions (clusters) each of which are assigned to a specific genotype. However, very few robust methods of accurate genotype clustering currently exist. Problems are often due to the fact that certain alleles occur at low frequencies and because biological samples are not necessarily representative of a natural population. As such, it is often times difficult to identify whether a particular genotype is represented in the data set (i.e., determine whether one or more clusters are missing), and if not present, where a missing genotype would lie in relation to other genotypes (i.e., predict the location of missing clusters). In cluster-based genotype analysis, improving cluster identification greatly improves the accuracy of genotype calls. Accordingly, there is a need for improved methods of analyzing genotyping data to accurately define, and if necessary, predict genotype clusters.