Learning often happens incrementally. At first students may be able to recall, recognize or name concepts. As mastery increases, they may be able describe concepts, the properties of concepts, or relationships among concepts. Eventually, students may be able to apply concepts to novel situations, use learned material to generate new insights, or synthesize learned material. This learning sequence is often referred to as “depth of knowledge,” and refers to the depth with which students understand the material that they are taught. The specific stages and levels of depth vary across taxonomies, but the general idea is that knowledge becomes deeper and more internalized with additional mastery, and that in turn allows more robust application of the knowledge.
When assessing student mastery, it is often desirable to evaluate their depth of knowledge. From the perspective of test developers it can be quite difficult to develop selected response items (test questions) that measure deeper levels of knowledge. A selected response item is a test question, such as a multiple choice question, in which the correct answer is selected from a collection of choices.
Many testing programs use constructed response items to measure content at deeper levels of knowledge. A constructed response item is an item that does not offer the examinee answer options from which to choose, but rather the examinee must construct a response.
In a typical system, each student's response is evaluated against a scoring rubric, which describes the characteristics of a response that should receive full credit. When partial credit is to be awarded, the characteristics of responses that receive some portion of the total overall score are also enumerated. For example, an item might award three points for full credit, and individually enumerate characteristics of imperfect responses that would warrant the award of two points, one point, and zero points.
The scoring rubric usually goes through a refinement process called rangefinding. In this process, samples of student responses (usually from a field test) are evaluated by a committee of subject matter experts with the goal of selecting sample responses exemplifying each score point to be awarded. It is not uncommon for the scoring rubrics to be refined during this process.
Using the refined rubric, human scorers apply the scoring criteria to score each examinee's response to the item. Typically, this process is monitored and managed, giving each scorer a number of pre-scored papers to evaluate whether they continue to apply the rubric correctly, and having a proportion of scored papers independently scored by a second scorer to monitor the reliability with which scorers apply the rubric.
The current process has several limitations. First, it is very expensive to score constructed response items by hand, requiring that each response be read by one or more qualified scorers. Furthermore, the process by which scoring rubrics are refined does not offer an opportunity for large-scale evaluation of the consequences of the refinements, risking potential unintended consequences. Additionally, the process necessarily takes time, limiting the usefulness of constructed response items in online tests. For example, adaptive online tests use the scores on items administered early in the test to select the best items to administer later. Due to current limitations, human scoring prevents using constructed response items to support adaptive testing.