The relationship between structure and function of macromolecules is of fundamental importance in the understanding of biological systems. These relationships are important to understanding, for example, the functions of enzymes, structural proteins and signaling proteins, ways in which cells communicate with each other, as well as mechanisms of cellular control and metabolic feedback.
Genetic information is critical in continuation of life processes. Life is substantially informationally based and its genetic content controls the growth and reproduction of the organism and its complements. The amino acid sequences of polypeptides, which are critical features of all living systems, are encoded by the genetic material of the cell. Further, the properties of these polypeptides, e.g., as enzymes, functional proteins, and structural proteins, are determined by the sequence of amino acids which make them up. As structure and function are integrally related, many biological functions may be explained by elucidating the underlying structural features which provide those functions, and these structures are determined by the underlying genetic information in the form of polynucleotide sequences. Further, in addition to encoding polypeptides, polynucleotide sequences also can be involved in control and regulation of gene expression. It therefore follows that the determination of the make-up of this genetic information has achieved significant scientific importance.
As a specific example, diagnosis and treatment of a variety of disorders may often be accomplished through identification and/or manipulation of the genetic material which encodes for specific disease associated traits. In order to accomplish this, however, one must first identify a correlation between a particular gene and a particular trait. This is generally accomplished by providing a genetic linkage map through which one identifies a set of genetic markers that follow a particular trait. These markers can identify the location of the gene encoding for that trait within the genome, eventually leading to the identification of the gene. Once the gene is identified, methods of treating the disorder that result from that gene, i.e., as a result of overexpression, constitutive expression, mutation, underexpression, etc., can be more easily developed.
One class of genetic markers includes variants in the genetic code termed xe2x80x9cpolymorphisms.xe2x80x9d In the course of evolution, the genome of a species can collect a number of variations in individual bases. These single base changes are termed single-base polymorphisms. Polymorphisms may also exist as stretches of repeating sequences that vary as to the length of the repeat from individual to individual. Where these variations are recurring, e.g., exist in a significant percentage of a population, they can be readily used as markers linked to genes involved in mono- and polygenic traits. In the human genome, single-base polymorphisms occur roughly once per 300 bp. Though many of these variant bases appear too infrequently among the allele population for use as genetic markers (i.e., xe2x89xa61%), useful polymorphisms (e.g., those occurring in 20 to 50% of the allele population) can be found approximately once per kilobase. Accordingly, in a human genome of approximately 3 Gb, one would expect to find approximately 3,000,000 of these xe2x80x9cusefulxe2x80x9d polymorphisms.
The use of polymorphisms as genetic linkage markers is thus of critical importance in locating, identifying and characterizing the genes which are responsible for specific traits. In particular, such mapping techniques allow for the identification of genes responsible for a variety of disease or disorder-related traits which may be used in the diagnosis and or eventual treatment of those disorders. Given the size of the human genome, as well as those of other mammals, it would generally be desirable to provide methods of rapidly identifying and screening for polymorphic genetic markers. The present invention meets these and other needs.
One aspect of the invention is an array of oligonucleotide probes for detecting a polymorphism in a target nucleic acid sequence using Principal Component Analysis, said array comprising at least one detection block of probes, said detection block including a first group of probes that are complementary to said target nucleic acid sequence except that the group of probes includes all possible monosubstitutions of positions in said sequence that are within n bases of a base in said sequence that is complementary to said polymorphism, wherein n is from 0 to 5, and a second and third group of probes complementary to marker-specific regions upstream and downstream of the target nucleic acid sequence, wherein the third group of probes differs from the second set of probes at single bases corresponding to known mismatch positions.
A further aspect of the invention is a method of identifying whether a target nucleic acid sequence includes a polymorphic variant using principal component analysis, comprising:
hybridizing said target nucleic acid sequence to said array comprising at least one detection block of probes, said detection block including a first group of probes that are complementary to said target nucleic acid sequence except that the group of probes includes all possible monosubstitutions of positions in said sequence that are within n bases of a base in said sequence that is complementary to said polymorphism, wherein n is from 0 to 5, and a second and third group of probes complementary to marker-specific regions upstream and downstream of the target nucleic acid sequence, wherein the third group of probes differs from the second set of probes at single bases corresponding to known mismatch positions; and
determining hybridization intensities of the target nucleic acid and the marker-specific regions to identify said polymorphic variant. In one embodiment of the invention, the step of determining comprises:
a) calculating the control difference between the average of the hybridization intensities of the second group of probes, the hybridization intensities comprising control perfect matches (PM), minus the average of the hybridization intensities, the hybridization intensities comprising control single-base mismatches (MM);
b) calculating the possible perfect match intensity and a heteromismatch intensity from the hybridization intensities for each position of monosubstitutions of the first group of probes;
c) calculating the difference between the possible perfect match intensity and the heteromismatch intensity for each position of monosubstitutions of the first group of probes;
d) calculating a normalized difference(ND) by dividing the difference of step (c) by the control difference;
e) using principal component analysis, identifying a polymorphism by comparing normalized differences between individuals in a population.