1. Field of the Invention
The present invention relates to a process of assessing the performance of crop varieties based on wide-area performance testing data. A process of the present invention compares varietal performance using spatial estimation and spatial prediction based on a statistical mixed effects model.
2. Description of Related Art
In the development of a new crop variety, performance data are collected on the variety and on other competing varieties. These performance data include measurements on various agronomic traits relevant to the given crop; e.g., for Zea mays, measurements taken on grain yield, grain moisture, and plant lodging.
In assessing the potential commercial value of a new crop variety (hereafter referred to as xe2x80x9cvarietyxe2x80x9d), its agronomic performance is compared to the agronomic performance of other varieties. The other comparison varieties include commercial and pre-commercial varieties from the company developing the variety and commercial varieties from competitor companies. Note that this same type of assessment is also performed on existing commercial varieties, to determine if they should remain on the market or be replaced by newer varieties in development.
Agronomic performance data for the new variety and for the comparison varieties come from multiple testing locations. The testing locations are usually widely distributed over the area of adaptation of the varieties included in the test. The area of adaptation covered by these testing locations is typically quite large, on the scale of hundreds of square miles. For example, a new Zea mays cultivar may be tested from western Iowa to eastern Michigan and from central Wisconsin to southern Illinois.
Due to variation in testing programs, the data for a given variety and its competitors tend to be quite xe2x80x98unbalancedxe2x80x99 in the sense that not all of the given set of varieties appear at all testing locations. Considering the testing data for a single pair of varieties, i.e., the new variety and a single competitor, some of the testing locations will have both of the varieties, while the rest will have only one of the two varieties.
These performance data are analyzed in order to determine the geographic regions over which the new variety has large enough performance advantages relative to the comparison varieties to justify its introduction to the market in those regions. Ideally, the variety under consideration will have a significant performance advantage relative to all of the comparison varieties over its entire area of testing. However, in some cases a variety may have performance advantages only on a regional basis, but it could still serve a significant market need within that region. Thus, it is important to characterize the performance of the given variety relative to other comparison varieties not only over the entire area where it was tested, but also within the various regions.
The variations in relative performance (performance difference) of two varieties in different geographic locations or regions arise from what is referred to as xe2x80x98genotype by environment interactionxe2x80x99 (Sprague and Eberhart, 1976). Genotype by environment interaction is caused by differential responses of varieties to environmental conditions. These environmental conditions may include, for example, day length, temperature, soil moisture, disease and insect pressure. Note that the term xe2x80x98environmentsxe2x80x99 can refer to different locations in a given year or different years for a given location or some combination of locations and years.
Methods involving traditional statistical analyses for varietal performance assessment are described in Bradley, et al (1988). These traditional methods are usually based on xe2x80x9clocation-matchedxe2x80x9d data, i.e., for a given variety relative to a comparison variety, data from only the testing locations where both varieties co-occur are used in the analysis. A paired t-test is used to test the hypothesis of no difference in performance of the two varieties. Moreover, for inferences regarding relative performance in a given geographic region, data from the testing locations only within that region are used in the analysis.
Traditional analysis using the t-test for location-matched data is inefficient for at least five reasons. First, it does not use all of the data; it only uses data from testing locations where both of the varieties co-occur. Second, for regional performance comparisons, it does not use data from nearby areas outside the region of interest. Third, it does not make use of covariates related to the performance trait of interest (e.g., irrigation or soil series) to help explain and predict the differences. Fourth, it only uses within-year data in the analysis model; more robust inferences can be accomplished by having a model that uses data from multiple years. Fifth, it is based on an incorrect assumption that the observations from one testing location are independent of those from other locations.
There are two broad reasons why the inefficiencies, listed above, are limitations in wide-area crop assessment. First is efficient use of the data. It is natural for the experimenter to want to use all of the data when making inferences. In the present scenario, this includes data on a variety at a location where the other variety does not occur, it includes data from areas that are in proximity to the region of interest, it includes data on covariates, and it includes data across years. Statistical methods should strive to make full use of all available information. The second reason that the traditional analysis is limited is that it is based on the classical assumption that data are independent Varietal performance data almost invariably violate the assumption of independence and render the statistical inference invalid, typically causing one to infer that the variety differences exist when the data do not really justify it.
A brief review of current literature also highlights underlying deficiencies in the art, which the present invention strives to solve.
One of the most relevant papers and probably the most noteworthy work in the area of application of spatial statistics to field plot experiment is by Zimmerman and Harville (1991), where the authors have introduced the so-called random field linear model (RFLM) by considering the observations as realization of a random field. In this model the trend is modeled by a mean structure and the small scale dependence is modeled through spatial autocorrelation structures. The parameter estimation is done through a likelihood approach. Through real data analysis, the authors have tried to demonstrate the superiority of their model over nearest neighbor analysis (NNA) approach. Note that their study is exclusively in the context of small area estimation where the range of spatial dependence is confined to a testing location.
Another noteworthy paper in the context of the use of covariate in spatial prediction is by Gotway and Hartford (1996), where the authors have presented the use of auxiliary or secondary variable(s) in spatial prediction by applying cokriging to predict soil nitrate level with data on grain yield as a covariate. Through an application of their method to data from a test site, they have demonstrated the benefit of their method over the more traditional external drift method. Again the scope of their study is limited to intra-site prediction.
One of the recent papers that deal with multi-location yield trials is by Cullis et.al. (1998). In this paper the authors have proposed a method for spatial analysis for multi-environment early generation variety trials. The method uses best linear unbiased prediction (BLUP) for genotype effect and genotype by environment interaction effect and REML for the spatial parameters and variance components. However, the proposed method is based on separately modeling the covariance structure for each trial, i.e., no across-trial correlation is taken into consideration.
Yost, Uehara and Fox (1982a and 1982b) were one of the first researchers in agricultural sciences to publish in the area of application of geostatistics to soil chemistry over large land area. In two consecutive publications they reported results of studies on spatial prediction of soil chemistry across the island of Hawaii. However, their study was limited to kriging only and did not consider any aspect of cokriging. Moreover, no other type of variables other than selected soil chemistry variables were used in their studies. Another similar but relatively more recent work in the area of soil science that is worth-mentioning is by Ovalles and Collins (1988). The authors used universal kriging to study spatial variation of selected soil properties in the entire northwest Florida covering an approximate area of 380 km by 100 km with a reported auto-correlation range of approximately 40 km. No attempt of any spatial estimation or cokriging was done in their study.
One of the papers that deal with spatial prediction of crop yield is by Bhatti, Mulla and Frazier (1991). The authors used an experimental field with approximate dimension of 655 meters by 366 meters to study wheat yield along with soil organic matter and soil phosphorus content. Kriging and cokriging were used to predict yield. However, the study was limited to a single experimental field, and therefore it did not contain any aspect of modeling of large scale trend that usually exists in wide-area testing. Another paper in the area of spatial analysis of crop yield is by Brownie, Bowman, and Burton (1993). In this paper, the authors have compared three alternative spatial methods: trend analysis, nearest neighbor Papadakis analysis, and correlated error analysis to study spatial variation in yield data on corn (Zea mays) and soybean (Glycine max). As in the case of other existing studies on crop yield, their study is also intra-site, i.e., it does not consider spatial correlation across experimental sites. In fact, the authors, as a concluding remark in their paper, have noted that no across-sites spatial analysis, where data from multiple locations are combined for analysis, exists in the literature with respect to data on crop yield.
Other papers related to spatial analysis of crop yield also exist in the literature. Bhatti, Mulla, Kooehler and Gurmani (1991) have used semi-variogram to identify spatial autocorrelation in crop yield. They show the effectiveness of NNA in removing spatial variability by studying the semi-variogram before and after the application of NNA. The scope of their study is again limited to intra-site analysis with the maximum range of spatial autocorrelation being approximately 20 meters. Moreover, no aspect of spatial estimation or prediction is covered in their study. Wu et. al. (1997) have compared the so-called first difference with errors in variables (FDxe2x88x92EV) method (Besag and Kempton, 1986) to the more traditional Papadakis nearest neighbor method and classical randomized complete block (RCB) analysis in terms of elimination of spatial variation in yield data from cereal breeding trials. However, their approach does not require pre- specified model for trend and the spatial autocorrelation structure. Moreover, their study is confined to only intra-site spatial variation.
Stroup, Baenziger and Mulitze (1994) have used data from breeding nurseries to compare the traditional RCB analysis, NNA, and the random field linear model analysis (Zimmerman and Harville, 1991) in terms of comparison of treatment effects through effective removal of noise due to spatial variability. Naturally, their study of spatial correlation is limited to data within each nursery.
A paper by Gotway and Stroup (1997) is unique in that the authors have extended the theory of generalized linear model to include spatial estimation and prediction of discrete and categorical spatial variables. They have applied their extensions to two data sets, one on plant damage due to insects and the other on weed count. However, as in the case of all other studies, the scope of their study and its application are limited to data within each experimental site.
In the above paragraphs, a review of the literature that currently exists in areas relevant to the present invention has been presented. The ensemble of research work in these areas can be broadly classified into two categories: (a) application of geostatistics and spatial statistics to areas of soil sciences in wide and small area testing, and (b) application of geostatistics and spatial statistics to crop response analysis to intra-field small area testing. The existing literature lacks the presence of research in the area of crop response analysis (e.g., Zea mays grain yield) in the context of wide area testing where the range of spatial correlation extends beyond individual experimental sites.
In the assessment of performance of crop varieties, it is essential that the conclusion be drawn across environments, i.e., across broad geographic regions covering multiple test sites. In the literature review, it should also be noticed that the existing literature does not address the use of both spatial estimation and spatial prediction in any study. In contrast, and as will be discussed in detail below, the current invention concerns a novel approach to the problem of variety assessment in that not only it is based on multi-environment wide area testing, it also has two components to answer two distinct questions: (a) an estimation component to answer the question of long term average performance of a variety or performance difference of two varieties, and (b) a prediction part to answer the question on performance of a variety or performance difference of two varieties at given points in time (year), and/or at given points in space (geography). The methodology behind the present invention takes into account large scale trend through universal kriging and universal block kriging and readily incorporates use of covariates through cokriging. The current literature lacks any work that combines all of the above features into a unified approach for the study of performance of crop varieties.
Problems enumerated above are not intended to be exhaustive, but rather are among many that tend to impair the effectiveness of previously known techniques concerning crop performance analysis. Other noteworthy problems may also exist; however, those presented above should be sufficient to demonstrate that methodology appearing in the art have not been altogether satisfactory.
Embodiments of the present invention employ a statistical model called a linear mixed model along with geostatistical methods to assess the performance of a crop variety from wide-area testing data. Crop performance may be assessed by measuring commercially important traits such as (but not limited to) yield, grain moisture, and plant lodging. In addition to the presence of variety main effect and the variety specific trend components as fixed effects, the mixed model that is employed in the present invention also allows the use of covariates such as year, soil type, irrigation, etc., as fixed effects in the model. Furthermore, it allows the use of random effects such as testing location, that can help explain the variation in crop performance. The residual variation that is not explained by the fixed and the random effects is modeled using geostatistical methods. The geostatistical models take into account the spatial auto-correlation in the data and allow valid confidence intervals to be obtained to assess uncertainty in the estimates and predictions.
Embodiments of the present invention have two distinct components: spatial estimation, and spatial prediction. The estimation component may be used as follows.
(1) Point estimation: estimate the long-term expected performance of a variety or performance difference between varieties and compute the associated standard errors at a each of a plurality of spatial locations Use these point estimates to construct a surface for performance or performance difference over a wide geographical area (e.g., collection of counties or states).
(2) Block estimation: estimate the long-term expected performance of a variety or performance difference between varieties and compute the associated standard errors over each of one or more given geographical areas such as market districts.
The prediction component may be used as follows.
(1) Point prediction: predict the average performance of a variety or performance difference between varieties and compute the associated standard errors at a each of a plurality of spatial locations and at each of one or more given time periods (years). Use these point predictions to construct a surface for performance or performance difference over a wide geographical area (e.g., collection of counties or states).
(2) Block prediction: predict the average performance of a variety or performance difference between varieties and compute the associated standard errors over each of one or more given geographical areas such as a market districts and at each of one or more given time periods (years).
Estimation and prediction of varietal performance, as described above, are required in decision making for selection of newly developed varieties which have the best performance compared to other varieties. Decisions need to be made about whether to bring a candidate variety to commercial status, and if so, to position it in appropriate geographies where it will perform well against its competitors.
From a marketing stand point, quantitative assessment of relative performance of a variety is required for two distinct reasons, first, for introducing a new variety into the market by advancing the variety to commercial status, and second, for decision making on replacement of an existing commercial variety by a new variety that shows better relative performance.
One of the most important criteria on which the above decisions are based on is the expected long-term performance advantage of a variety. This is required for assessing the commercial value of the variety in the marketplace where the value of a variety is measured over its entire lifetime on the market. The estimation component of the present invention provides answer to this question through an assessment of the long-term relative performance of a variety at a given location or over a given geographic region.
The prediction component of the present invention allows assessment of the performance of a variety and the performance difference between varieties in a given year and at a given location or over a given geographic region. This space- and/or time-specific performance assessment is necessary in determining the consistency of performance or relative performance of a variety across space and/or time. Consistency of performance, which is called xe2x80x9cvariety stabilityxe2x80x9d in plant breeding terminology, is a very desirable attribute for a commercial variety.
Advantages of the present invention over traditional methods of wide-area testing include the following. First, it does not require that the data be location-matched (each location in the data set contains observations on both varieties), instead, it uses data from all testing locations having at least one of the varieties under consideration. Second, for regional performance assessment, it is not restricted to data from within the region. Instead, it also uses data from test locations in surrounding regions. Third, it readily incorporates information on covariates, which are related to the primary trait of interest (e.g., data on soil series when yield is the primary trait of interest). Fourth, it uses a model that can accommodate multiple year data by incorporating year as a model factor. Fifth, it does not assume independence of observations coming from different testing locations. Instead, it utilizes the spatial dependence between the testing locations to provide improved statistical inferences.
Steps in one embodiment of the present invention may be summarized as follows:
1. Construct a database of wide-area performance of crops that includes names and spatial coordinates of testing locations, geographic areas in which the testing locations reside, names and performance trait values of a number of varieties, and names and values of the covariates.
2. Select two varieties for comparison, e.g., a xe2x80x98headxe2x80x99 variety under consideration for advancement to commercial status and a xe2x80x98competitorxe2x80x99 variety, which is already in the market.
3. Remove outliers from the data on the two varieties by visual inspection and by using statistical tests based on the hat matrix, and Cook""s D.
4. Choose a spatial covariance model for the linear mixed model.
5. Estimate the spatial covariance parameters and other fixed and random effect parameters of the linear mixed model.
6. Use the estimated parameters to
(a) estimate the across-years average or predict the yearly,
(b) continuous surface, or block averages over geographical regions,
(c) for a performance trait of each variety, or the difference in the performance trait between varieties, and
(d) obtain standard errors of estimates or predictions.
7. Use the estimates and the predictions, along with their standard errors for assessment of variety performance, e.g., for the assessment of the relative performance of the head variety for taking decision on its advancement to commercial status.
In one respect, the invention is a method for assessing wide-area performance of a crop variety using a linear mixed model that incorporates geostatistical components and includes parameters for fixed effects, random effects and covariances. By xe2x80x9cwide area,xe2x80x9d it is simply meant the putative area of a tested variety. For instance, by xe2x80x9cwide area,xe2x80x9d it may be meant the area of adaptation of a tested variety, which may include many testing locations. By xe2x80x9ccrop variety,xe2x80x9d it is meant a cultivar of a given plant species and any other usage as is known in the art. A wide-area database is constructed that includes spatial coordinates of testing locations of one or more crop varieties and performance trait values of one or more crop varieties. By xe2x80x9cdatabase,xe2x80x9d it is meant any collection of data. For instance, a xe2x80x9cdatabasexe2x80x9d may refer to an electronic collection of data that is searchable. By xe2x80x9cperformance trait value,xe2x80x9d it is meant any agronomic trait of interest associated with a particular variety. For instance, xe2x80x9cperformance trait valuexe2x80x9d may refer to any number of traits known in the art. For example, it may refer to grain yield, grain moisture, or plant lodging of, for instance, Zea mays. The parameters for the fixed effects, random effects and covariances are estimated by fitting the linear mixed model with data in the wide-area database. Long-term expected performance of the crop variety is estimated for each of one or more given spatial locations using the parameter estimates. By xe2x80x9cestimating long-term expected performance,xe2x80x9d it is meant estimating a value of an agronomic trait of a variety as that phrase is known in the art. This estimate, by definition not being dependent on any given time period, can still be used to characterize expected performance over a commercially-relevant time period, e.g., the time the variety is on the market. xe2x80x98Time periodxe2x80x99 may refer to, for example, a year, a collection of years, or any other period.
In other respects, the database further may also include covariate data. The estimating long-term expected performance may include estimating long-term expected performance differences between the crop variety and another crop variety. The estimating the parameters may include a method of restricted maximum likelihood. The estimating long-term expected performance may include a method of generalized least squares. The method may also include removing data from the database using a method of leverage or Cook""s Distance prior to estimating the parameters. The method may also include calculating a standard error associated with the long-term expected performance. The method may also include forming an output of the long-term expected performance. The output may include text output. The output may include graphical output. The graphical output may include a contour plot representing a continuous surface of long-term expected performance.
In another respect, the invention is a method for assessing wide-area performance of a crop variety using a linear mixed model that incorporates geostatistical components and includes parameters for fixed effects, random effects and covariances. A wide-area database is constructed that includes spatial coordinates of testing locations of one or more crop varieties, geographic areas in which the testing locations reside, and performance trait values of one or more crop varieties. The parameters for the fixed effects, random effects and covariances are estimated by fitting the linear mixed model with data in the wide-area database. Long-term expected performance of the crop variety is estimated for each of one or more given geographic areas using the parameter estimates.
In another respect, the invention is a method for assessing wide-area performance of a crop variety using a linear mixed model that incorporates geostatistical components and includes parameters for fixed effects, random effects and covariances. A wide-area database is constructed that includes spatial coordinates of testing locations of one or more crop varieties and performance trait values of one or more crop varieties. The parameters for the fixed effects, random effects and covariances are estimated by fitting the linear mixed model with data in the wide-area database. Average performance of the crop variety is predicted for each of one or more given spatial locations and for each of one or more given time periods using the parameter estimates. By xe2x80x9cpredicting average performance,xe2x80x9d it is simply meant predicting a value of an agronomic trait of a variety for a given geographic locations and for a given time period and the normal usage of that phrase as known in the art.
In other respects, the estimating the parameters may include a method of restricted maximum likelihood. The predicting average performance may include the method of universal kriging. The database may also include covariate data. The covariate data may include one or more response variables. The predicting average performance may include the method of universal cokriging. The covariate data may include only one or more fixed effects. The predicting average performance may include the method of universal kriging. The predicting average performance may include predicting average performance differences between the crop variety and another crop variety. The method may also include removing data from the database using a method of leverage or Cook""s Distance prior to estimating the parameters. The method may also include calculating a standard error associated with the predicted average performance. The method may also include forming an output of the predicted average performance.
In another respect, the invention is a method for assessing wide-area performance of a crop variety using a linear mixed model that incorporates geostatistical components and includes parameters for fixed effects, random effects and covariances. A wide-area database is constructed that includes spatial coordinates of testing locations of one or more crop varieties, geographic areas in which the testing locations reside, and performance trait values of one or more crop varieties. The parameters for the fixed effects, random effects and covariances are estimated by fitting the linear mixed model with data in the wide-area database. Average performance of the crop variety for each of one or more given geographic areas and for each of one or more given time periods is predicted using the parameter estimates.
In another respect, the invention is a method of hybrid development. A hybrid is developed. Performance data for the hybrid and a comparison hybrid is obtained. A cubic polynomial surface is fitted to the performance data for each hybrid using the method of generalized least squares and modeling the residual variance using a spherical variogram. The performance of the new and comparison hybrid is compared.
In another respect, the invention is a system including a computer and a program. The program executes on the computer and includes program code for: fitting a cubic polynomial surface to the performance data for each hybrid using the method of generalized least squares; modeling the residual variance using a spherical variogram; and comparing the performance of the new and comparison hybrid.
As will be understood with the benefit of this disclosure, point prediction, block prediction, point estimation, and block estimation may be combined in any number of different permutations to obtain valuable performance assessments. All such combinations fall within the scope of this invention.