For many years, standardized tests have been administered to examinees for various reasons such as for educational testing or for evaluating particular skills. For instance, academic skills tests, e.g., SATs, LSATs, GMATs, etc., are typically administered to a large number of students. Results of these tests are used by colleges, universities and other educational institutions as a factor in determining whether an examinee should be admitted to study at that particular institution. Other standardized testing is carried out to determine whether or not an individual has attained a specified level of knowledge, or mastery, of a given subject. Such testing is referred to as mastery testing, e.g., achievement tests offered to students in a variety of subjects, and the results are used for college credit in such subjects.
Many of these standardized tests have essay sections. Essay questions, however, are commonly looked upon as providing a more well-rounded assessment of a particular test taker's abilities. These essay portions of an exam typically require human graders to read the wholly unique essay answers. As one might expect, essay grading requires a significant number of work-hours, especially compared to machine-graded multiple choice questions. It is, therefore, desirable to provide a computer-based automatic scoring system to evaluate written student essays more efficiently.
Typically, essays are graded based on scoring rubrics, i.e., descriptions of essay quality or writing competency at each score level. For example, the scoring rubic for a scoring range from 0 to 6 specifically states that a “6” essay “develops ideas cogently,” organizes them logically, and connects them with clear transitions.” A human grader simply tries to evaluate the essay based on descriptions in the scoring rubric. This technique, however, is subjective and can lead to inconsistent results. It is, therefore, desirable to provide an automatic scoring system that is accurate, reliable and yields consistent results.
Literature in the field of discourse analysis points out that lexical (word) and structural (syntactic) features of discourse can be identified (Mann, William C. and Sandra A. Thompson (1988): Rhetorical Structure Theory: Toward a functional theory of text organization, Text 8 (3), 243-281) and represented in a machine, for computer-based analysis (Cohen, Robin: A computational theory of the function of clue words in argument understanding, in “Proceedings of 1984 International Computational Linguistics Conference.” California, 251-255 (1984); Hovy, Eduard, Julia Lavid, Elisabeth Maier, Vibhu Nettal and Cecile Paris: Employing Knowledge Resources in a New Text Planner Architecture, in “Aspects of Automated NL Generation,” Dale, Hony, Rosner and Stoch (Eds), Springer-Veriag Lecture Notes in Al no. 587, 57-72 (1992); Hirschberg, Julia and Diane Litman: Empirical Studies on the Disambiguation of Cue Phrases, in “Computational Linguistics” (1993), 501-530 (1993); and Vander Linden, Keith and James H. Martin: Expressing Rhetorical Relations in Instructional, Text: A Case Study in Purpose Relation in “Computational Linguistics” 21(1), 29-57 (1995)).
Previous work in automated essay scoring, such as by Page, E. B. and N. Petersen: The computer moves into essay grading: updating the ancient test. Phi Delta Kappa; March, 561-565 (1995), reports that predicting essay scores using surface feature variables, e.g., the fourth root of the length of an essay, shows correlations as high as 0.78 between a single human rater (grader) score and machine-based scores for a set of PRAXIS essays. Using grammar checker variables in addition to word counts based on essay length yields up to 99% agreement between machine-based scores that match human rater scores within 1 point on a 6-point holistic rubric. These results using grammar checker variables have added value since grammar checker variables may have substantive information about writing competency that might reflect rubric criteria, such as whether the essay is free from errors in mechanics, and whether proper usage and sentence structure are present.
U.S. Pat. Nos. 6,181,909 and 6,366,759, both of which are assigned to Educational Testing Service, the assignee of the present application and are herein incorporated by reference in their entirety, provide automated essay grading systems. For example in U.S. Pat. No. 6,181,909, a method includes the automated steps of (a) parsing the essay to produce parsed text, wherein the parsed text is a syntactic representation of the essay, (b) using the parsed text to create a vector of syntactic features derived from the essay, (c) using the parsed text to create a vector of rhetorical features derived from the essay, (d) creating a first score feature derived from the essay, (e) creating a second score feature derived from the essay, and (f) processing the vector of syntactic features, the vector of rhetorical features, the first score feature, and the second score feature to generate a score for the essay. In the U.S. Pat. No. 6,181,909, the essay is graded in reference to prompt specific human graded essays, wherein the human graded essays are written in response to a specific essay prompt, and are analyzed according to the same features as the essay to be graded. The essay scoring system includes several feature analysis programs which may evaluate essays based on syntactic features, rhetorical features, content features, and development/organizational features. The essay is graded based on a holistic grading scale, e.g., 1-6 scoring categories.
In known essay scoring engines, a set of four critical feature variables is used to build a final linear regression model used for predicting scores, referred to as predictor variables. All predictor variables and counts of predictor variables are automatically generated by several independent computer programs. In these scoring engines, all relevant information about the variables are introduced into a stepwise linear regression in order to evaluate the predictive variables, i.e., the variables that account for most of the variation between essays at different score intervals.
In U.S. Pat. No. 6,366,759, another essay grading system using an automated essay scoring system, the essay being a response to a test question. The steps include (a) deriving a vector of syntactic features from the essay; (b) deriving a vector of rhetorical features from the essay; (c) deriving a first score feature from the essay; (d) deriving a second score feature from the essay; and (e) processing the vector of syntactic features, the vector of rhetorical features, the first score feature, and the second score feature to generate a score for the essay. In the U.S. Pat. No. 6,366,759, the essay is graded in reference to prompt specific human graded essays, wherein the human graded essays are written in response to a specific essay prompt, and are analyzed according to the same features as the essay to be graded. The essay scoring system includes several feature analysis programs which may evaluate essays based on syntactic features, rhetorical features, content features, and development/organizational features. The essay is graded based on a holistic grading scale, e.g., 1-6 scoring categories.
There is a need to develop systems and methods to automatically evaluate and grade essays and texts, wherein the score of the automatic analysis corresponds greatly with human-based scoring, wherein the scoring does not require voluminous sample data in order to complete the automatic grading, wherein a set of features is developed to accurately evaluate an essay, wherein the feature set may be standardized, wherein the scoring model used to evaluate and essay is re-usable across multiple essay prompts, and wherein grading may be more standardized across all essay prompts.