Artificial selection programs are mainly concerned with increasing genetic gain by virtue of the contributions of more genes from “good” ancestors. The traditional means for determining genetic gain expresses gain as the product of selection intensity, accuracy, and genetic standard deviation defined in a single generation. Woolliams et al., Genetics 153, 1009-1020 (1999) showed that the process of contributing genes to a population involves more than a single generation and that sustained gain depends on Mendelian sampling variation entering the population in each generation. Put simply, genetic gain from artificial selection will be related to the genetic long-term contribution of an ancestor to the population as well as the marginal breeding value of an individual, thereby linking genetic gain to pedigree development.
For centuries, artificial selection has been entirely based on phenotype. Whilst this has proven useful, it is time-consuming and expensive. In particular, artificial selection based on phenotype may use progeny testing wherein the estimated breeding value of an individual is determined by performing multiple matings of the individual and determining the performance of the progeny for a particular trait or phenotypic character. For example, Schaeffer J. Anim. Breed. Genet 123, 218-223 (2006) estimated that the time taken to prove one Holstein bull takes approximately 64 months from conception to first proof, assuming a 9 month gestation period and that young bulls are test mated at one year of age and females are mated at 15 months of age. In this example, the total cost of proving one bull was estimated at about US $40,000, including the cost of housing and feeding the bull, collection and storage of semen, test matings and classification of daughters. However, the cost to an artificial insemination company that bulk purchases young bull calves for stud would be much greater, albeit offset by the return to service of any young bull.
Genomics has provided the prospect of artificial selection based on genotype. A complete genome sequence for a species enables the construction of any number of DNA chips or microarrays of about 10,000 or more nucleic acids each of which comprises a polymorphic marker. Knowledge of informative alleles, genes, polymorphisms, haplotypes or haplogroups etc for a particular QTL or trait facilitates the screening of individuals or germplasm and estimates of their EBV to be made. This is because genotypic selection relies upon the ability to genotype individuals for specific genes or markers that are either in linkage equilibrium (sparse markers) or linkage disequilibrium (dense markers) with a particular QTL or other locus of interest such that the breeding value of an individual can be estimated using marker haplotypes associated with the QTL or other locus. Genotypic selection is especially powerful where selection is desirably- or necessarily-independent of expression e.g., in the case of selection on milk production traits in male animals. Genotypic selection may not be pedigree-based, when the genotypic associations on which it is based are derived from a current population or, in the case of sparse marker maps, when the genotypic associations are derived from large half-sib family data or limited crosses.
Genotypic selection of “best” individuals can be based upon a score assigned to an informative allele, gene, polymorphism, haplotype or haplogroup etc of the individual alone, or in tandem with phenotype-based EBV or genotype-based EBV. Multiple bases for selection are preferred to minimize the loss in response to polygenes or other QTL. Walsh Theor. Population Biol 59, 175-184 (2001) also suggested that phenotype should remain a component in selection, to capture variation arising from new mutations and to prevent drastic reductions in effective population size, accumulated mutational variance from random genetic drift and the long term rate of response to selection that would otherwise arise from selection targeting specific genotypes.
Genotypic selection is facilitated by computational means, including resampling approaches e.g., randomisation tests and bootstrapping, which allow for the construction of confidence intervals and proper tests of significance e.g., Best Linear Unbiased Predictors (BLUP; Henderson In: “Applications of Linear Models in Animal Breeding”, University of Guelph, Guelph, Ontario, Canada; Lynch and Walsh, In: “Genetics and Analysis of Quantitative Traits”, Sunuaer Associates, Sunderland Mass., USA, 1998); the Markov Chain Monte Carlo (MCMC) approach (Geyer et al., Stat. Sci. 7, 73-511, 1992; Tierney et al., Ann. Statist. 21, 1701-1762, 1994; Tanner et al., In: “Tools for Statistical Analysis”, Springer-Verlag, Berlin/New York, 1996); the Gibbs sampler (Geman et al., IEEE Trans. Pattern Anal. Mach. Intell. 6, 721-741, 1984); Bayesian posterior distribution (e.g., Smith et al., J. Royal Statist. Soc. Ser. B55, 3-23, 1993). Under Bayesian analysis, semi-subjective probabilities as to a population parameter are assigned to uncertainties and then analyzed and refined with experience, thereby permitting a prior belief about a population parameter to become updated to a posterior belief. For example, resampling-based Bayesian methods for multiple QTL mapping have been proposed by Sillanpaa and Arjas, Genetics 148, 1373-1388 (1998); Sillanpaa and Arjas, Genetics 151, 1605-1619 (1999); and Stephens and Fisch, Biometrics 54, 1334-1347 (1998). Meuwissen et al., Genetics 157, 1819-1829 (2001) simulated a genome of 1000 cM with markers assumed to be in linkage disequilibrium spaced 1 cM apart throughout the genome such that the markers were combined into haplotype pairs surrounding every 1 cM region, and compared least squares, BLUP and Bayesian approaches for estimating the effects of each haplotype pair simultaneously (50,00 haplotype effects in total) i.e., for the whole population and not specific to any one individual; the authors showed that the aggregate EBV could be determined for progeny provided that those animals were genotyped and the marker haplotypes were determined at an accuracy of 0.75-0.85 for all approaches. In this simulation, the effective population size was assumed to be constant.
Sparse marker maps can be constructed using markers in linkage equilibrium and spaced about 20 cM apart based upon large half-sib family data or limited crosses. For example, Georges et al, Genetics 139, 907-929 (1995) prepared a sparse genetic map of genetic markers that resulted in the detection of some QTL for milk production, and the inclusion of marker information into BLUP breeding values predicted a gain of 8-38% (Meuwissen and Goddard, Genet. Sci. Evol. 28, 161-176 (1996). However, the utility of such information is limited in outbreeding populations because the linkage phase between a marker and QTL must be established for each and every family in which the marker is to be used for selection. Accordingly, there are significant implementation problems with known sparse mapping approaches.
Dense marker maps, generally constructed from single nuclear polymorphisms (SNPs) and/or microsatellites provide for mapping of quantitative trait loci (QTL), association studies, and estimates of relatedness between individuals in a sample of a population. With dense marker maps, markers are more likely to be in linkage disequilibrium with a QTL and so more positively associated with a quantitative trait of interest than for a sparse map, such that selection does not require linkage phase to be established for each family. Markers in linkage disequilibrium are generally within about 1 cM to 5 cM of a locus of interest. Moreover, the identification of linkage disequilibrium markers requires candidate genes (Rothschild and Soller, Probe 8, p13, 1997) or fine mapping approaches (Anderson et al., Nature Reviews Genet. 2, 130-138, 2001). Thus, for a genome of about 3000 cM, about 3001 markers at 1 cM intervals or more are needed.
Notwithstanding the theoretical ability to produce dense genome-wide marker maps that theoretically cover whole genomes, there are several constraints on the application of such technology. Because there is an absolute requirement for the markers in such maps to be informative, the actual numbers of markers required are much larger than a theoretical minimum. Moreover, there is a need to construct haplotypes inherited from the parent(s) for each contiguous pair of bi-allelic markers, one of four possible informative haplotypes will be linked to a single QTL on average, and the frequencies of each haplotype will vary depending on the frequency of each contributing allele as well as the distance between the markers. This means that sufficient animals must be genotyped to ensure that all haplotypes are represented and their effects determined. The requirement for dense markers means that the number of animals required will also increase depending on genome size. Finally, dense marker maps do not exist for all species.
The high cost of genotyping renders it infeasible to implement all available markers across the genomes of most species. Such costs arise from the initial association of haplotype effects, which is correlated with the constraint referred to in the preceding paragraph, and the unit cost of genotyping an individual to estimate its breeding value. For example, in the case of cattle, Schaeffer J. Anim. Breed. Genet 123, 218-223 (2006) has estimated that a minimum of about 10,000 markers in a genome-wide dense marker map would be required, and that the approximate unit cost of genotyping one animal for this number of SNP markers is about US $400. The actual unit cost compares unfavourably with what would be acceptable to industry i.e., about US $20-200 per animal. However, if we assume that the haplotype effects are derived from 50 sire families with 50 sons each, the cost is closer to US $1,000,000. This cost will naturally increase if additional individuals are genotyped e.g., daughters of the sons in the proofs, in accordance with standard practice. Thus, to initialize a genome-wide scheme using dense marker maps is costly to implement, because of the large numbers of individuals that need to be genotyped to estimate haplotype effects and because of high unit costs. Such high costs hinder industry uptake of the technology. Methods for the cost-effective implementation of genome-wide selection using dense marker maps are not routinely available.
Several authors have proposed the identification of minimum informative subsets of SNPs that would permit reconstruction of haplotypes inferred by genotyping all other previously-known SNPs in a current population i.e., independent of pedigree, especially with reference to the human genome i.e., “tagging SNPs” (e.g., Avi-Itzhak et al., Proc. Pacific Symposium Biocomputing 8, 466-477, 2003; Hampe et al., Hum. Genet. 114, 36-43, 2003; Ke et al., Bioinformatics 19, 287-288, 2003; Meng et al., Am. J. Hum. Genet. 73, 115-130, 2003; Sebastiani et al., Proc. Natl Acad. Sci USA 100, 9900-9905, 2003; Stram et al., Hum. Heredity 55, 179-190, 2003 Thompson et al., Hum. Heredity 56, 48-55, 2003; Wang et al., Hum. Mol, Genet. 12, 3145-3149, 2003; Weale et al., Am. J. Hum. Genet, 73, 551-565, 2003; Halldórsson et al., Genome Res, 14, 1633-3640, 2006). Such methods require the determination of neighbourhoods of linkage disequilibrium in the genome to thereby determine those SNPs (“tagged SNPs”) that can be used to infer each other (because they are linked). Such neighbourhoods may be haplotype blocks for which two SNPs are considered to be correlated if they occur in the same haplotype block with little evidence of recombination between them (e.g., Johnson et al., Nature Genetics 29, 233-237, 2001; Zhang et al., Am. J. Hum. Genet. 73, 63-73, 2003), or a union of possible haplotype blocks that contain particular SNPs (e.g., Halldórsson et al., Genome Res. 14, 1633-3640, 2006). Alternatively, neighbourhoods are deemed to consist of only those SNPs within a distance of less than 1 LD unit of each other based on metric LD maps (e.g., Maniatis et al., Proc. Natl Acad. Sci USA 99, 2228-2233, 2002). However, until recently there was no means of defining informativeness of tagged SNPs within the neighbourhoods of linkage disequilibrium i.e., determining how well any tagged SNP would characterize the genetic diversity or variance observed for the neighborhood, because the models used assumed that the genome regions dealt with were small and not many SNPs were involved. Zhang et al., Am. J. Hum. Genet. 73, 63-73 (2003) proposed a method for dealing with large data sets wherein chromosomes are partitioned into haplotype blocks and a set of tagging SNPs are selected within each block by imposing a cost for not tagging a given SNP in terms of the loss in haplotype diversity. Halldórsson et al., Genome Res. 14, 1633-3640 (2006) suggested an algorithmic framework for defining the informativeness of large SNP datasets in human chromosome 22, using a block-free method for determining neighbourhoods in linkage disequilibrium, which requires haplotype phase data to be available. Basically the informativeness measure of Halldórsson et al., is calculated by examining haplotype patterns for a set of neighbours of a target SNP, determining those pairs of haplotypes having different alleles at the target SNP, and then determining the proportion of those pairs of haplotypes that do not have the same set of alleles on all SNPs in the set of neighbours. Notwithstanding the advantages of tagging SNPs, such methods still require large numbers of SNPs to be genotyped.
Accordingly, there remains a need for informative and cost-effective methods of performing artificial selection using a genomics-based approach.