There are numerous methods of estimating population proportions for polls. In 1961, James and Stein derived estimators of population means that are more efficient than corresponding traditional estimators by using a linear combination of the mean of an individual sample and the overall mean of the sample aggregated with two or more other samples from possibly different populations. Being within the 0-1 interval, the weight applied to each individual sample mean is called a shrinkage coefficient.
Commenting on the empirical Bayesian treatment of James-Stein estimators by Efron and Morris (1973), Stigler (1983, 1990) showed that the shrinkage coefficient was an estimator of the squared correlation coefficient in the regression of population on sample means. Fienberg and Holland (1973) extended the empirical Bayesian treatment of James-Stein estimators to single-sample population proportions, with the expected increase in efficiency.
Likewise, there are numerous methods of testing the ability of a subject and/or the difficulty of a task. Testing methods originally tended to focus on the total test score. Over time testing methods have developed to include a focus on individual responses.
As the focus on total test scores in classical test theory shifted to individual item responses in modern test theory, the models underlying the theories changed correspondingly from measurement and estimation to probabilistic models. In the measurement model of classical test theory, an observed test score (X) differs from a true test score (T) by error (E): X=T+E. Measurement error (E) formally disappears in modern test theory, where the concept of uncertainty expressed by response probability replaces the concept of imprecision expressed by measurement error. In modern test theory, the probability of a correct response to an item is a function of an examinee's ability (θ) and the item's difficulty (b), as well as possibly other item parameters such as the discrimination parameter (a) and, for multiple-choice items, the guessing-rate parameter (c): P(θ, a, b, c). While probabilistic models involving all three item parameters are popular because of their promise of optimal fit, many test developers use the statistically simpler single-parameter logistic model introduced by Rasch (1960):
                              P          ⁡                      (                          θ              ;              b                        )                          =                  1                      1            +                          e                              -                                  (                                      θ                    -                    b                                    )                                                                                        (        1        )            the graph of which is an ogive curve centered at b on θ scale.
Prior to Rasch (1960), Birnbaum (1958) introduced logistic item response models, his two-parameter version involving both the location (difficulty) parameter b and the slope (discrimination) parameter a:
                              P          ⁡                      (                                          θ                ;                a                            ,              b                        )                          =                  1                      1            +                          e                              -                                  a                  ⁡                                      (                                          θ                      -                      b                                        )                                                                                                          (        2        )            Like Equation (1), the graph of Equation (2) is an ogive on the θ scale centered at b; but, different from Equation (1), the slope of the ogive may vary depending on the value of the parameter a, which, in the context of the relationship between measurement error and response probability, is the focus of this disclosure.
Some test developers may consider the single-parameter model described by Equation (1) as unnecessarily limited in its data-fitting ability in contrast to alternatively available two- or three-parameter models. Yet, studies show that the Rasch model may fit data at least as well as its multiple-parameter counterparts (e.g., Forsyth, Saisangjan, & Gilmer, 1981). Thissen (1982), in particular, showed that the addition of the parameter a to the Rasch model may fail to improve model fit significantly. Because other studies may show otherwise (e.g., DeMars, 2001, and Stone & Yumoto, 2004), some test developers who favor the Rasch model might still wish that a single-parameter logistic model could accommodate differences in item discrimination, as well as item difficulty. At the same time, the allowance of differences in item discrimination to affect the estimation of θ values may disturb other supporters of the Rasch model because they believe it unfair to weight responses to items differently, at least without informing test-takers. That concern, however, cannot justify counting clearly less and more discriminating items equally in scoring, particularly when the results of equal and appropriate unequal weighting of item responses differ substantially.
Different from the development here, involving estimation of item discrimination from data, Verhelst and Gias (1995) introduced the discrimination parameter ai into the Rasch model as an unestimated constant to account for varying item discrimination. Because it explicitly lacked the unweighted-scores property of the Rasch model, they also referred to their model simply as a single-parameter logistic model. Weitzman (1996) used an adjustment of fxiq like piq (see below, Detailed Description of Invention) to enable the Rasch model to account for guessing, but that adjustment required the assumption that the guessing rate was constant over items. Weitzman (2009) provides the original account of the invention described here.
Generally, single-parameter models do not tend to account for item discrimination, which is how well the item measures what it is supposed to measure. Single-parameter models do, however, lead to accurate equating of different test forms. Two and three parameters models tend to account for item discrimination. However, two and three parameter models lead to inaccurate equating of different test forms. Accordingly, a need exists for improved test modeling.