Many mathematical problems involve analyzing data to determine relationships between variables. For example, in regression analysis an expression can be determined to describe data (which is sometimes referred to as ‘fitting’ the expression to the data). This is shown in FIG. 1A, which presents a drawing 100 illustrating the fitting a line to data. The equation for a line y (the independent variable) can be expressed asy=mx+b, where x (the data) is the dependent variable, and m and b are unknown coefficients (the slope and y-intercept, respectively) that are to be determined during the fitting. In this example, each datum in the data corresponds to a point in the x-y plane (such as x0, y0).
Typically, the minimum number of data points needed to uniquely determine the fitting equation equals the number of unknowns in the fitting equation (as shown in FIG. 1A, for a line, the minimum number of data points is two). If there are more data points than this minimum number, statistical techniques such as least-squares regression may be used to determine the unknown coefficients. However, if there are fewer data points available than the minimum number, it is typically not possible to uniquely determine the unknowns. This is shown in FIG. 1B, which presents a drawing 150 illustrating the fitting of multiple lines to a datum. In principle, there are an infinite number of equivalent fitting solutions that can be determined. This type of problem is sometimes referred to as ‘sparse’ or ‘underdetermined.’
Unfortunately, many interesting problems are underdetermined. For example, in biology, important differences between different individual's genomes can be described by single nucleotide polymorphisms (SNPs). As shown in FIG. 2, which presents a drawing 200 illustrating a SNP 210, a SNP is a deoxyribonucleic-acid (DNA) sequence variation that occurs when a single nucleotide, such as adenine (A), thymine (T), cytosine (C), or guanine (G), in a chromatid in the genome (or another shared sequence) differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, AA . . . CT . . . CA . . . A to AA . . . TT . . . CA . . . A, contain a difference in a single nucleotide (in this case, there are two alleles, C and T). Variations in the DNA sequences of humans can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. Consequently, there is great interest in identifying associations between SNPs and the expression of such traits or phenotype information in a population of individuals, such as matched cohorts with and without a disease.
However, even after eliminating correlated SNPs using a haplotype map (which includes information about closely related alleles that are inherited as a unit), there may still be several hundred thousand or more SNPs for each individual in a population. In order to identify the associations, these SNPs may be compared to the expression of a trait in the population, such as the occurrence of a disease. Typically, the population may include several thousand individuals. Consequently, identifying the associations involves ‘fitting’ several hundred thousand SNPs (the fitting space) to several thousand data points, which is an extremely underdetermined problem that increases the complexity, time and expense when trying to identify the associations.
Furthermore, it is unusual for a disease (or, more generally, an expressed trait) to be associated with a single gene. More typically, the disease is associated with multiple genes (i.e., it is polygenetic), as well as one or more environmental factors. In the case of SNPs, including these additional variables and/or combinations of variables causes a power-law increase in the size of the fitting space. If the population size (several thousand people) remains unchanged, the problem becomes vastly underdetermined. Unfortunately, increasing the size of the population is often difficult because of the associated expense and time needed to obtain biological samples.
Therefore, there is a need for an analysis technique to identify associations in underdetermined problems without the problems listed above.