One of the main objectives of plant and animal improvement is to obtain new cultivars that are superior in terms of desirable target features such as yield, grain oil content, disease resistance, and resistance to abiotic stresses.
A traditional approach to plant and animal improvement is to select individual plants or animals on the basis of their phenotypes, or the phenotypes of their offspring. The selected individuals can then, for example, be subjected to further testing or become parents of future generations. It is beneficial for some breeding programs to have predictions of performance before phenotypes are generated for a certain individual or when only a few phenotypic records have been obtained for that individual.
Some key limitations of methods for plant and animal improvement that rely only on phenotypic selection are the cost and speed of generating such data, and that there is a strong impact of the environment (e.g., temperature, management, soil conditions, day light, irrigation conditions) on the expression of the target features.
Recently, the development of molecular genetic markers has opened the possibility of using DNA-based features of plants or animals in addition to their phenotypes, environmental information, and other types of features to accomplish many tasks, including the tasks described above.
Some important considerations for a data analyses method for this type of datasets are the ability to mine historical data, to be robust to multicollinearity, and to account for interactions between the features included in these datasets (e.g. epistatic effects and genotype by environment interactions). The ability to mine historical data avoids the requirement of highly structured data for data analyses. Methods that require highly structured data, from planned experiments, are usually resource intensive in terms of human resources, money, and time. The strong environmental effect on the expression of many of the most important traits in economically important plants and animals requires that such experiments be large, carefully designed, and carefully controlled. The multicollinearity limitation refers to a situation in which two or more features (or feature subsets) are linearly correlated to one another. Multicollinearity may lead to a less precise estimation of the impact of a feature (or feature subset) on a target feature and consequently biased predictions.
A framework based on mining association rules and using features created from these rules to improve prediction or classification, is suitable to address the three considerations mentioned above. Preferred methods for classification or prediction are machine learning methods. Association rules can therefore be used for classification or prediction for one or more target features.
The approach described in the present disclosure relies on implementing one or more machine learning-based association rule mining algorithms to mine datasets containing at least one plant or animal molecular genetic marker, create features based on the association rules found, and use these features for classification or prediction of target features.