1. Field of the Invention
The present invention relates to a method and a system for evaluating genotyping results for analytic work of determining genotypes that are believed to be involved in differences among individual organisms (e.g., differences in terms of appearance and susceptibility to diseases). In particular, the present invention relates to a method and a system for evaluating results of distinguishing genotype signals from noise signals generated by amplifying DNA fragments that contain genes to be analyzed by PCR and detecting them by electrophoresis.
2. Background Art
Sequence determination of whole genomes of a variety of organisms such as humans has been completed. In the cases of organisms such as humans that have been decoded, genetic analysis studies have been actively conducted with regard to whole genomes and relatively large regions of such genomes. In particular, in medical studies, techniques for automatically determining many genotypes have been gaining attention for the purpose of identifying genes related to the presence or absence of diseases and the presence or absence of favorable effects or adverse effects of medicines, for example. In addition, in order to improve determination accuracy, a technique for evaluating automatically determined individual genotypes has been awaited.
Microsatellites
In general, many portions genomes of individual organisms belonging to the same species have completely identical nucleotide sequences. However, it has been known that some portions of genomes have nucleotide sequences that differ among different individuals. Such differences found in nucleotide sequences of individual genomes are referred to as polymorphisms. Several different types of polymorphisms are known to exist. In particular, the use of SNPs (single nucleotide polymorphisms) and microsatellites for analysis studies has been gaining attention.
The term “microsatellite” indicates a sequence in which several to several tens of repetitions of a short sequence pattern of 2 to 6 nucleotides appear. Human genomes contain more than several tens of thousands of microsatellites. FIG. 18 shows examples of microsatellites that appear in genomes. A set of nucleotides repeated in a microsatellite is referred to as a “unit.” The number of nucleotides contained in such a unit is referred to as a “unit length.” For instance, in the case of a microsatellite having a pattern “ATATATAT . . . ” as shown in FIG. 18, the unit consists of “AT” and the unit length is 2 nucleotides. As shown in FIG. 18, there are differences among microsatellites (polymorphisms) having the same unit and the same unit length in terms of the number of units that are repeated.
As described above, since SNPs and microsatellites are associated with polymorphisms, they are easily distinguishable from other nucleotide sequences in genomes and they are experimentally detectable with ease. In the cases of some biological species, approximate positions of SNPs and microsatellites in genomes have been known. Thus, SNPs and microsatellites can be used as positional indicators in genomes. For these characteristics, SNPs and microsatellites are referred to as DNA markers. In particular, microsatellites contain a plurality of nucleotides so as to have greater information content compared with SNPs. Thus, microsatellites have often been used as DNA markers in genome-wide analysis studies.
As shown in FIG. 18, individuals of many organisms have the diploid genome (homologous chromosomes) derived from a female gamete and a male gamete. Genes that exist on corresponding sites in the diploid genome are called alleles. Such a combination of alleles is referred to as a genotype. As described above, SNPs and microsatellites in genomes are portions having nucleotide sequences that differ among different individuals. In general, two or three alleles are found in SNPs, while on the other hand, several to 20 types of alleles or more are found in microsatellites.
In an example shown in FIG. 18, individual A has an allele in which a unit “AT” is repeated 3 times and an allele in which the same unit is repeated 5 times, while on the other hand, individual B has an allele in which a unit “AT” is repeated 6 times and an allele in which the same unit is repeated 3 times. Also, individual C has 2 alleles each in which a unit “AT” is repeated 4 times. The state in which individuals have two different alleles (e.g., individuals A and B) is referred to as heterozygosity. Meanwhile, the state in which individuals have the two of the same allele (e.g., individual C) is referred to as homozygosity.
PCR and Electrophoresis Experimentation
With the use of microsatellites as DNA markers, microsatellite portions in a genome are extracted and detected by an experiment involving PCR (polymerase chain reaction), electrophoresis, and the like. PCR is an experimental technique whereby a sample can be obtained at a certain yield by allowing a pair of nucleotide sequences at both ends of a microsatellite, which are called primer sequences, to be subjected to a reaction with a DNA replicase so as to repeatedly replicate and amplify DNA fragments each comprising a microsatellite sandwiched by a pair of primer sequences. Electrophoresis, including gel electrophoresis and capillary electrophoresis, is an experimental technique whereby amplified DNA fragments are allowed to migrate in a charged migration path such that DNA fragments having different lengths are separated depending on different migration rates, based on molecular weights, charged levels, and the like. FIG. 19 schematically shows experimental procedures for amplifying DNA fragments that are microsatellite portions by PCR and gel electrophoresis. First, a pair of primer sequences 1900 and 1901 that sandwich a microsatellite of interest are designated and a genome region 1902 comprising the microsatellite and the primer sequences is amplified by PCR. FIG. 19 shows an example of a heterozygote, in which two homologous chromosomes differ in terms of the number of repeat units in a microsatellite. Since the homologous chromosomes differ in terms of the microsatellite length, two types of PCR amplification products of different lengths, namely DNA fragments (containing 52 nucleotides and 48 nucleotides, respectively), can be obtained. When these fragments are subjected to gel electrophoresis for a given period of time, the above two types of PCR amplification products are separated based on difference in DNA fragment length. Each DNA fragment is previously labeled with fluorescence dye, followed by electrophoresis. Then, the intensities and the positions of the fluorescence signals of the DNA fragments are detected. Thus, as shown in FIG. 19, a graph on which the DNA fragment length (fragment size) and the fluorescence signal intensity (i.e., abundance of DNA fragment) are plotted on the horizontal axis and the vertical axis, respectively, can be obtained. In addition, when PCR amplification products are subjected to electrophoresis simultaneously with DNA fragments with known lengths (size markers) so as to detect fluorescence signals, the length of each PCR amplification product can be obtained based on the position at which a size marker is detected.
Experimental techniques involving gel electrophoresis are described above. Also, such techniques can be carried out using capillary electrophoresis whereby the length of a DNA fragment is examined by allowing a sample to migrate through a thin tube filled with gel and measuring a period of time required for the sample to migrate a certain distance (normally to the end of a capillary). Upon capillary electrophoresis, it is usual to detect a sample using a fluorescence signal detector that is installed at the end of a capillary, in stead of scanning a fluorescence signal from a sample in gel.
Noise Generated During PCR and Electrophoresis Experiments
The peak results shown in FIG. 19 can be obtained when PCR and electrophoresis are carried out in an ideal process. In an actual experiment, a variety of noise peaks are generated in many cases. Examples of major noise peaks upon interpretation of experimental results include stutter peaks and +A peaks.
As shown in FIG. 20, stutter peaks are generated by a phenomenon in which a complementary strand of a template sequence strand to be replicated is formed upon PCR at a position where a continuous repetitive sequence of a microsatellite has slipped, resulting in formation of a hairpin-loop template strand (slipped-strand mispairing). Thus, a DNA fragment to be replicated has a microsatellite with an increased or decreased number of repeat units so that a noise peak is observed based on a fluorescent signal from the DNA fragment having an allele with the increased or decreased number of repeat units. In particular, it has been known that such noise peak tends to be generated when microsatellites having short unit lengths are amplified. In addition to a peak derived from a DNA fragment having the same length as the original DNA fragment, stutter peaks derived from a DNA fragment having a length that has increased or decreased by the integer multiple of a unit length of a microsatellite is observed.
+A peaks are generated by a phenomenon in which an excess nucleotide (normally “A”) is added to a DNA fragment due to a replicase action upon replication of a DNA fragment by PCR. Thus, a +A peak is observed as a noise peak based on a fluorescence signal from a DNA fragment length to which a single nucleotide has been added. Such addition of a single nucleotide occurs to each DNA fragment from which a stutter peak is generated as described above, as well as to an original DNA fragment subjected to replication. Thus, based on a fluorescence signal, a +A peak is observed to be located at a distance of 1 unit length to the right of each stutter peak.
FIG. 21 shows a schematic view of a situation in which stutter peaks and +A peaks as described above are observed. FIG. 21 shows a waveform of a heterozygote containing two alleles. The waveform contains two peaks, each of which corresponds to an allele size having the same length as an original DNA fragment subjected to replication (hereafter referred to as a “true peak”). In addition, the waveform consists of two sets of peaks in which a center peak is a true peak. A first set of peaks contains stutter peaks that are located at distances of 2 units to the left, 1 unit to the left, and 1 unit to the right of a true peak. The sets also contain +A peaks corresponding to the true peak and the stutter peaks. A second set of peaks contains stutter peaks that are located at distances of 1 unit to the left and 1 unit to the right of a true peak. The set also contains +A peaks corresponding to the true peak and the stutter peaks. Hereafter, a true peak or a stutter peak that corresponds to a DNA fragment to which a single nucleotide is not added and that is responsible for generation of a particular +A peak is referred to as an “original peak.”
Non-Patent Document 1 and the like teaches methods for determining true peaks from a plurality of peaks comprising noise peaks in the waveform of a fluorescence signal from a given individual, such signal being obtained during PCR and electrophoresis experiments.
Also, some methods for evaluating genotyping results have been reported and disclosed in Patent Document 1, Non-Patent Document 1, and the like. In addition, the software “TrueAllele” from Cybergenetics and the software “GeneMapperID” from Applied Biosystems (ABI) have been known to have functions for evaluating genotyping results.
[Patent Document 1] JP Patent Publication (Kokai) No. 2006-17461 A
[Non-Patent Document 1] Matsumoto T. et al., “Novel algorithm for automated genotyping of microsatellites,” Nucleic Acids Research, Vol. 32, No. 20 (2004) pp. 6069-6077