The present invention generally relates to systems and methods for assessment of constructed responses. More specifically, the invention relates to the presentation of constructed responses for human evaluation and the analysis of human evaluators"" assessment.
Many tests require examinees to provide answers, or constructed responses, that include written words and essays or figural responses which can be scanned in as images. Other tests may require that examinees enter their responses in electronic format, using a computer application directly, such as the Computer Based Testing System disclosed in U.S. Pat. No. 5,565,316, assigned to Educational Testing Service and incorporated herein by reference. Automated computer-based systems have been developed to permit human evaluation of textual or figural responses on-line. However, other tests require review of responses in other, more complicated forms. For example, a test question, or prompt, could require an examinee to provide an oral response (Test of Spoken English, foreign language examinations, etc.) or to videotape a performance. Other test questions may require that an examinee create a diagram or drawing which is too complex for scanning to provide an appropriate representation for evaluation. The National Council of Architectural Registration Board (NCARB) administers a licensing exams for architects in which an examinee""s response is created through a specially designed computer application and may have multiple overlapping layers. The analysis of the responses to the NCARB exam requires human evaluators to precisely measure each line and angle to determine the appropriate score for the examinee. Therefore, a drawing application is a more appropriate environment for presentation of the constructed response to the human evaluator.
A separate dedicated computer-based assessment system is required to permit human evaluation of these various constructed response types on-line. Thus, there exists a need for one assessment system to dynamically determine which computer application will provide the optimum presentation capabilities for constructed responses in a variety of forms. It is further desired for a single assessment system to automatically initialize the chosen computer application and to present the constructed response to the human evaluators through the chosen computer application.
Furthermore, the need to monitor human evaluators to assure accuracy of assessment has been recognized. Presently, this has been accomplished only through presentation of monitoring papers, which have a predetermined score associated with them, or repeated presentation of the same constructed responses to ensure consistency. This is inefficient since it requires that the human evaluators take time to review and assess constructed responses which do not really require scores. Furthermore, repeated presentation of the same constructed responses is frustrating to the evaluators and does not provide for accurate assessment. Thus, there further exists a need for an assessment system capable of evaluating and monitoring the human evaluators to guarantee consistency and accuracy of grading without utilizing constructed responses which do not need assessment and, thus, wasting time and other resources.
Finally, the need to minimize the influence of extraneous factors on a human evaluator""s assessment has been well documented. For example, the time of day that a constructed response is presented to a human evaluator may influence the score awarded. Thus, safeguards are required to insure consistency and fairness when human evaluators are assessing constructed responses.
Test developers are also concerned with assessing the difficulty of test questions. To promote fairness, test questions presented to different examinees that are intended to be of the same difficulty should have highly consistent difficulty levels to prevent variations in difficulty of the test questions from affecting scores of the examinees.
Complex manual grading designs and methods have been used in the past to investigate the difficulty of test questions and the effect of outside influences on human evaluators. However, there exists a need for a computer-based assessment system which can be used as a tool in test and scoring development. There further exists a need for methods of presenting constructed responses to various human evaluators in a controlled manner so that the extraneous factors may be minimized. Finally, there exists a need for presenting constructed responses to human evaluators so that test question difficulty and human evaluator scoring may be assessed without the need for excessive repetition.
The present invention provides systems and methods for use in presenting constructed responses through various computer applications to human evaluators in a controlled manner to allow for monitoring and evaluation of both the human evaluators and the test questions. The systems and methods overcome the problems of the prior art systems described above and provide a more efficient and controllable monitoring and test development tool.
The systems of the present invention utilize a relational database for storing data related to the constructed responses, the human evaluators and the computer applications. The constructed responses can be categorized based on many things, including descriptive characteristics of the constructed response that are of interest to a particular research scientist; most frequently, they are categorized based on the prompt which elicited the response. Groups of related prompts, or the individual prompts, by which the constructed responses are categorized are referred to herein as constructed response categories. The database, or memory, generally holds the data so that each human evaluator is assigned to a plurality of constructed responses (via assignments to constructed response categories) which he will assess. Furthermore, in the database, each constructed response is stored in relation to at least one computer application which is capable of presenting the constructed response to the human evaluator so that a meaningful assessment may be made.
The systems for presenting the constructed responses to human evaluators utilizing a related computer application further comprise at least one assessment station for the human evaluator to review the constructed responses and award a score. Furthermore, the systems utilize a processor for accessing the data in the database, for enabling an applicable computer application for use with the constructed response to be presented to the human evaluator and for presenting the constructed responses to the human evaluator. The system may further comprise a database, which could be the same relational database described above, for storing the scores awarded by the human evaluators to the constructed responses such that the score is stored in relation to both the constructed response and the human evaluator. In addition, the system of the present invention can utilize a plurality of assessment stations, wherein a human evaluator is assigned to each assessment station. In that case, a communication link between the processing means and the assessment stations may be used for transmitting the constructed responses from the database to the assessment stations and for transmitting scores from the assessment stations to the database.
The methods of the present invention of analyzing human evaluator assessments and difficulty of constructed response categories or individual test items also utilize a database as described above. The methods further comprise the steps of electronically transmitting a plurality of constructed responses assigned to two or more constructed response categories to a first human evaluator and a plurality of constructed responses assigned to two or more constructed responses categories to a second human evaluator, wherein at least one of the constructed response categories is the same for the first and second human evaluator. The methods further provide for electronically receiving scores awarded by the first and second human evaluator for each of the constructed responses and storing the scores in a database. Based on the information to be obtained, the methods provide for comparing the scores awarded by the first and second human evaluators and the scores awarded to the constructed responses whose constructed response category was the same for both human evaluators to analyze the human evaluators"" assessments and the difficulty of the question types. Preferably, a statistical computer application such as SAS or SPSS uses the data collected during the method described above to perform more complex analysis.
The methods of the present invention of controlling the presentation of the constructed responses to the human evaluators during an assessment session to control psychometric effects in the scoring process also utilize a database as described above. The methods further comprise the steps of assigning each constructed response to be assessed by at least two human evaluators, assigning each human evaluator to at least two constructed response categories and ordering the constructed responses to be presented to the human evaluators such that the human evaluators receive the constructed responses in a different order during the assessment session. This method may further comprise the steps of time shifting the constructed response categories to be assessed by each human evaluator during an assessment session. Furthermore, the constructed responses assigned to a particular human evaluator within a constructed response category may be selectively ordered.