For many years, standardized tests have been administered to examinees for various reasons such as for educational testing or for evaluating particular skills. For instance, academic skills tests, e.g., SATs, LSATs, GMATs, etc., are typically administered to a large number of students. Results of these tests are used by colleges, universities and other educational institutions as a factor in determining whether an examinee should be admitted to study at that particular institution. Other standardized testing is carried out to determine whether or not an individual has attained a specified level of knowledge, or mastery, of a given subject. Such testing is referred to as mastery testing, e.g., achievement tests offered to students in a variety of subjects, and the results are used for college credit in such subjects.
Many of these standardized tests have essay sections. These essay portions of an exam typically require human graders to read the wholly unique essay answers. As one might expect, essay grading requires a significant number of work-hours, especially compared to machine-graded multiple choice questions. Essay questions, however, often provide a more well-rounded assessment of a particular test taker's abilities. It is, therefore, desirable to provide a computer-based automatic scoring system.
Typically, graders grade essays based on scoring rubrics, i.e., descriptions of essay quality or writing competency at each score level. For example, the scoring guide for a scoring range from 0 to 6 specifically states that a "6" essay "develops ideas cogently, organizes them logically, and connects them with clear transitions." A human grader simply tries to evaluate the essay based on descriptions in the scoring rubric. This technique, however, is subjective and can lead to inconsistent results. It is, therefore, desirable to provide an automatic scoring system that is accurate, reliable and yields consistent results.
Literature in the field of discourse analysis points out that lexical (word) and structural (syntactic) features of discourse can be identified (Mann, William C. and Sandra A. Thompson (1988): Rhetorical Structure Theory: Toward a functional theory of text organization, Text 8(3), 243-281) and represented in a machine, for computer-based analysis (Cohen, Robin: A computational theory of the function of clue words in argument understanding, in "Proceedings of 1984 International Computational Linguistics Conference." California, 251-255 (1984); Hovy, Eduard, Julia Lavid, Elisabeth Maier, Vibhu Nettal and Cecile Paris: Employing Knowledge Resources in a New Text Planner Architecture, in "Aspects of Automated NL Generation," Dale, Hony, Rosner and Stoch (Eds), Springer-Veriag Lecture Notes in Al no. 587, 57-72 (1992); Hirschberg, Julia and Diane Litman: Empirical Studies on the Disambiguation of Cue Phrases, in "Computational Linguistics" (1993), 501-530 (1993); and Vander Linden, Keith and James H. Martin: Expressing Rhetorical Relations in Instructional, Text: A Case Study in Purpose Relation in "Computational Linguistics" 21(1), 29-57 (1995)).
Previous work in automated essay scoring, such as by Page, E. B. and N. Petersen: The computer moves into essay grading: updating the ancient test. Phi Delta Kappa; March, 561-565 (1995), reports that predicting essay scores using surface feature variables, e.g., the fourth root of the length of an essay, shows correlations as high as 0.78 between a single human rater (grader) score and machine-based scores for a set of PRAXIS essays. Using grammar checker variables in addition to word counts based on essay length yields up to 99% agreement between machine-based scores that match human rater scores within 1 point on a 6-point holistic rubric. These results using grammar checker variables have added value since grammar checker variables may have substantive information about writing competency that might reflect rubric criteria such as, essay is free from errors in mechanics, usage and sentence structure.