This invention relates generally to document processing and automated identification of discourse elements, such as a thesis statements, in an essay.
Given the success of automated essay scoring technology, such application have been integrated into current standardized writing assessments. The writing community has expressed an interest in the development of an essay evaluation systems that include feedback about essay characteristics to facilitate the essay revision process.
There are many factors that contribute to overall improvement of developing writers. These factors include, for example, refined sentence structure, variety of appropriate word usage, and organizational structure. The improvement of organizational structure is believed to be critical in the essay revision process toward overall essay quality. Therefore, it would be desirable to have a system that could indicate as feedback to students, the discourse elements in their essays.
The invention facilitates the automatic analysis, identification and classification of discourse elements in a sample of text.
In one respect, the invention is a method for automated analysis of an essay. The method comprises the steps of accepting an essay; determining whether each of a predetermined set of features is present or absent in each sentence of the essay; for each sentence in the essay, calculating a probability that the sentence is a member of a certain discourse element category, wherein the probability is based on the determinations of whether each feature in the set of features is present or absent; and choosing a sentence as the choice for the discourse element category, based on the calculated probabilities. The discourse element category of preference is the thesis statement. The essay is preferably in the form of an electronic document, such as an ASCII file. The predetermined set of features preferably comprises the following: a feature based on the position within the essay; a feature based on the presence or absence of certain words wherein the certain words comprise words of belief that are empirically associated with thesis statements; and a feature based on the presence or absence of certain words wherein the certain words comprise words that have been determined to have a rhetorical relation based on the output of a rhetorical structure parser. The calculation of the probabilities is preferably done in the form of a multivariate Bernoulli model.
In another respect, the invention is a process of training an automated essay analyzer. The training process accepts a plurality of essays and manual annotations demarking discourse elements in the plurality of essays. The training process accepts a set of features that purportedly correlate with whether a sentence in an essay is a particular type of discourse element. The training process calculates empirical probabilities relating to the frequency of the features and relating features in the set of features to discourse elements.
In yet other respects, the invention is computer readable media on which are embedded computer programs that perform the above method and process.