Much work has been performed in the area of text document classification. For example, e-mail sorting has been proposed in Sahami, Dumais, Heckerman & Horvitz, “A Bayesian Approach to Filtering Junk E-Mail,” Learning for Text Categorization: Papers from the 1988 Workshop, AAAI Technical Report WS-98-05 (1998) and Cohen, Carvalho & Mitchell, “Learning to Classify Email into ‘Speech Acts,’” EMNLP 2004, each of which are incorporated herein by reference in their entireties.
Text-document classification is also performed by Internet-based search engines, such as are described in Joachims, “Optimizing Search Engines Using Clickthrough Data,” Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (2002) and McCallum, Nigam, Rennie & Seymore, “Building Domain-Specific Search Engines with Machine Learning Techniques,” AAAI-99 Spring Symposium, each of which are incorporated herein by reference in their entireties.
Other work teaches the classification of news articles, such as Allan, Carbonell, Doddington, Yamron & Yang, “Topic Detection and Tracking Pilot Study: Final Report,” Proceedings of the Broadcast News Transcription and Understanding Workshop, pp 194-218 (1998) and Billsus & Pazzani, “A Hybrid User Model for News Story Classification,” Proceedings of the Seventh International Conference on User Modeling (UM '99), Banff Canada (Jun. 20-24, 1999), each of which are incorporated herein by reference in their entireties.
Moreover, information in medical reports can be classified by text documentation classifiers, such as those taught by Hripcsak, Friedman, Alderson, DuMouchel, Johnson & Clayton, “Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing,” Ann Intern Med 122(9): 681-88 (1995); and Wilcox & Hripcsak, “The Role of Domain Knowledge in Automating Medical Text Report Classification,” Journal of the American Medical Information Association 10:330-38 (2003), each of which are incorporated herein by reference in their entireties.
In addition, research has been performed in the area of automated essay scoring, such as by Page, “The Imminence of Grading Essays by Computer,” Phi Delta Kappan 48:238-43 (1966); Burstein et al., “Automated Scoring Using a Hybrid Feature Identification Technique,” Proceedings of 36th Annual Meeting of the Association of Computational Linguistics, pp 206-10 (1998); Foltz, Kintsch & Landauer, “The Measurement of Text Coherence Using Latent Semantic Analysis,” Discourse Processes 25(2-3): 285-307 (1998); Larkey, “Automatic Essay Grading Using Text Categorization Techniques,” Proceedings of the 21st ACM-SIGIR Conference on Research and Development in Information Retrieval, pp 90-95 (1998); and Elliott, “Intellimetric: From Here to Validity,” in Shermis & Berstein, eds., “Automated Essay Scoring: A Cross-Disciplinary Perspective” (2003), each of which are incorporated herein by reference in their entireties.
In the area of automated essay evaluation and scoring, systems have been developed that perform one or more natural language processing (NLP) methods. For example, a first NLP method includes a scoring application that extracts linguistic features from an essay and uses a statistical model of how these features are related to overall writing quality in order to assign a ranking or score to the essay. A second NLP method includes an error evaluation application that evaluates errors in grammar, usage and mechanics, identifies an essay's discourse structure, and recognizes undesirable stylistic features.
Additional NLP methods can provide feedback to essay writers regarding whether an essay appears to be off-topic. In this context, an off-topic essay is an essay that pertains to a different subject than other essays in a training corpus, as determined by word usage. Such methods presently require the analysis of a significant number of essays that are written to a particular test question (topic) and have been previously scored by a human reader to be used for training purposes.
One such method for determining if an essay is off-topic requires computing two values determined based on the vocabulary used in an essay. In the method, a “z-score” is calculated for each essay for each of two variables: a) a relationship between the words in the essay response and the words in a set of training essays written in response to the essay question to which the essay responds, and b) a relationship between the words in the essay response and the words in the text of the essay question. A z-score value indicates an essay's relationship to the mean and standard deviation values of a particular variable based on a training corpus of human-scored essay data from which off-topic essays are excluded. A z-score value is computed using the mean value and the corresponding standard deviation for the maximum cosine value or the prompt cosine value based on the human-scored training essays for a particular test question. The formula for computing a z-score for a particular essay is equal to
            value      -      mean              std      .                          ⁢      dev      .        .In order to identify off-topic essays, z-scores are computed for: a) the maximum cosine value, which is the highest cosine value among all cosines between an essay and all human-scored training essays, and b) the essay question cosine value, which is the cosine value between an essay and the text of the essay question. When a z-score exceeds a pre-defined threshold, the essay is likely to be anomalous (i.e., off-topic), since the threshold is typically set to a value representing an acceptable distance from the mean.
The accuracy of such an approach can be determined by examining the false positive rate and the false negative rate. The false positive rate is the percentage of appropriately written, on-topic essays that have been incorrectly identified as off-topic. The false negative rate is the percentage of off-topic essays not identified as off-topic. Typically, it is preferable to have a lower false positive rate so that a student is not incorrectly admonished for writing an off-topic essay.
For a particular essay set, the false positive rate using this method is approximately 5%, and the false negative rate is approximately 37%, when the z-scores of both the maximum cosine and essay question cosine measures exceed the thresholds. For bad faith essays, the average false negative rate is approximately 26%. A false positive rate is meaningless since the essays are not written to any essay topic.
Such a method requiring a training corpus is particularly limiting where users of the method, such as teachers providing essay questions to students, require the ability to spontaneously generate new topics for their students. Another case where a training corpus is limiting occurs in the case where content developers periodically desire to add new topics to a system embodying such methods. In either case, if a significant number of essays are to be scored, it would be preferable to automatically determine whether each essay is directed to the particular topic question.
None of the above methods include a method for automatically evaluating whether an essay is off-topic without utilizing a training corpus of essays on a particular topic.
The present invention is directed to solving one or more of the above-listed problems.