In order to discover a gene that causes a complex disease such as common diseases, and to realize personalized medicine, it is required to estimate a haplotype of a human individual from experimental data such as genotype data.
Conventionally, the haplotype across multiple loci is estimated from the genotype data per locus. The genotype data is independent per each locus, and the relationship across the loci (phase) is not known. FIG. 1 is an illustration of an example of the genotype data per locus. In FIG. 1, L and A represent locus and allele, respectively.
As shown in FIG. 1, the genotype data per locus has count number data of each allele in each locus in each individual. The count number data is data of a count number obtained by counting the allele in each locus in the individual. For example, in FIG. 1, the count number of an allele (A1) in a locus (L1) in an individual 1 is “1”, and the count number of the allele (A1) in a locus (L3) is “2”.
As explained above, the genotype data does not directly specify the phase, the relationship of alleles between the loci is unknown; in an example in FIG. 1 (for example, the individual 1), the phase between the loci L1 and L2 can not be specified from the count number data, and the relationship between the alleles (A1/A2) in the locus L1 and the alleles (A1/A3) in the locus L2 is unknown. Therefore, a method of estimating the haplotype (specifying the phase) is required.
A haplotype estimating method disclosed in Non-patent Documents 1 to 4 estimates the haplotype across the loci from the genotype data per locus. The haplotype across the loci means a combination of the alleles across the loci (combination to specify the phase). FIG. 2 is an illustration of an example of a combination of the haplotypes across the loci. In FIG. 2, A(L) represents the allele A that corresponds to the locus L.
As shown in FIG. 2, for example, it is specified that a haplotype 1 has the allele A1 in the locus L1, the allele A1 in the locus L2, and the allele A1 in the locus L3. In the conventional haplotype estimating method, two types of alleles are assumed in general, and the haplotype across the loci is estimated from the genotype data such as those of a single nucleotide polymorphism (abbreviated as “SNP”).
There is a polymorphism referred to as a copy number polymorphism (or copy number variation, sometimes abbreviated as “CNP” in this description) other than the nucleotide polymorphism such as the SNP. FIG. 3 is an illustration of an example of the copy number polymorphism and the nucleotide polymorphism. In FIG. 3, M and F correspond to a sequence site (marker site), which is not different between individuals and identified by a label such as a fluorochrome probe, and a nucleotide (polymorphic nucleotide), which may be different between the individuals (distinguished by different fluorochromes and the like), respectively.
As shown in FIG. 3, in the copy number polymorphism, the sequence of a certain section (referred to as a “copy unit”) might repeatedly appear, and there is difference in the copy number among individuals. For example, as shown in FIG. 3, in homologous chromosomes (chromosomes 1 to 4), the copy number is 1 in the chromosome 1, the copy number is 0 in the chromosome 2, the copy number is 2 in the chromosome 3, and the copy number is 3 in the chromosome 4, so that they are different to each other.
When the polymorphic nucleotide is present on the copy unit, unlike the nucleotide polymorphism in a genomic region in which the copy number polymorphism is not present, the count number of the polymorphic nucleotide depends on the copy number of two haplotypes (that is to say, diplotype) in the individual. That is to say, although the count number is basically 0, 1, or 2 (since a sexually reproducing individual has the chromosomes in pairs as the homologous chromosomes) in the nucleotide polymorphism in the genomic region in which the copy number polymorphism is not present, when there is the nucleotide polymorphism on the copy number unit, the count number of the polymorphic nucleotide varies in each individual depending on the copy number, for example, 0, 1, 2, 3, 4, 5, . . . . That is to say, the count number when there is the nucleotide polymorphism on the copy number unit does not directly link to the genotype per locus as in the conventional technique. The count number in the copy number polymorphism means the count number obtained by counting the polymorphic nucleotide associated with the marker site specified by the label on the copy unit.
Non-patent Document 5 is the haplotype estimating method for the copy number polymorphism, and is the method of estimating the haplotype across the loci by classifying the alleles into two types, which are the allele of which copy number is large and the allele of which copy number is small.
As the more general haplotype estimating method for the copy number polymorphism, there is a method of associating the copy number with the type of the allele, supposing multiple types of alleles, and estimating the haplotype across the loci.
Patent Document 1 discloses using Expectation-Maximization (EM) algorithm as the method for calculating the frequency of the haplotype.
Patent Document 1: JP-A-2004-192018    Non-Patent Document 1: Tianhua Niu “Algorithms for inferring haplotypes” Genet Epidemiol., 2004 Dec., 27(4)334-347    Non-Patent Document 2: Zhaohui S. Qin, Tianhua Niu, Jun S. Liu “Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms” Am J Hum Genet., 2002 Nov., 71(5)1242-1247    Non-Patent Document 3: Laurent Excoffier, Montgomery Slatkin “Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population” Mol Biol Evol, 1995 September., 12(5)921-927    Non-Patent Document 4: M. E. Hawley, K. K. Kidd “HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes” J Hered., 1995 September-October, 86(5)409-411    Non-Patent Document 5: Richard Redon, Shumpei Ishikawa, Karen R. Fitch, et al. “Global variation in copy number in the human genome” Nature, 2006 Nov. 23, 444(7118)444-454