Traditional test calibration methods are limited by the amount of information sharing that takes place between task authors and psychometricians. Typically, psychometricians have access to very little data regarding the authoring of the tasks and provide very little feedback to the authors about which tasks are most effective.
When an educational assessment is given in multiple forms—when different examinees receive different patterns of tasks, but the scores for each set of tasks are expected to be comparable—a mechanism for equating those scores is required. Conventional assessment techniques calibrate the statistical model used to score each task so that all tasks are scaled to a common set of dimensions. Although expert opinions used in determining calibration weights often closely approximate relative difficulty, pretesting the tasks is an essential aid to discovering how difficult the tasks are in practice. Pretesting not only corrects inaccurate expert estimates, but also reveals surprising tasks that do not align with the statistical model.
Four sources of information are typically available when calibrating tasks: (i) expert opinions as to a task's difficulty and the extent to which the task draws on various knowledge, skills and abilities of an examinee; (ii) pretest data from pretesters exposed to the task; (iii) the similarity of the task to other tasks with known or partially known parameters; and (iv) features of the task which are known to affect difficulty (known as radicals). While pretest data comes only from field testing of a task, the other three sources of information can be gathered from the assessment design process.
Calibration is also important in checking a theoretical measurement model. For example, it may be discovered that a particular task does not perform as expected during calibration. The task may be harder, easier, have a different evidentiary focus (i.e., test a different set of skills), or have undesirable characteristics, such as being non-monotonic in one or more skills or having markedly different evidential properties for different sub-populations. Traditional testing procedures used pretesting to expose such tasks. However, the procedures do not analyze the characteristics of such tasks to improve future task design.
What is needed is a method of using previously authored tasks to determine the likely performance of future-designed tasks.
A further need exists for a method of calibrating differing sets of tasks to a common score range.
A further need exists for a method of assessing a difficulty of a particular task across sub-populations to determine a fairness level for the task or to determine a difficulty level for each sub-population.
A further need exists for a method of improving the design of future tasks based on information contained in current tasks.
The present invention is directed towards solving one or more of the problems listed above.