One of the striking developments in computational linguistics in recent years has been the rapid progress in the automatic analysis of text. This is especially so where the extraction of semantic content is concerned. The adoption of statistical, corpus-based techniques within natural language processing, the continued development of information extraction techniques, and the emergence of more effective algorithms for extracting particular aspects of linguistic and discourse structure have largely driven such progress. Such algorithms include topic chains and rhetorical argument structures.
Effective applications have become a reality in a variety of fields, such as machine translation and automatic summarization, due to the progress of automated text analysis applications. However, current automated text analysis applications tend to rely almost solely on lexical cooccurrence, a simple form of linguistic evidence, with very little analysis beyond the application of straightforward statistical techniques.
For example, the E-Rater™ essay scoring system, described in U.S. Pat. Nos. 6,181,909 and 6,366,759 to Burstein et al., which are incorporated herein by reference in their entireties, identifies aspects of the content, style and rhetorical structure of essays by using content vectors to induce simple measures of how closely the vocabulary in target essays matches the vocabulary usage of essays in a training set. The Criterion™ essay feedback system provides feedback regarding potential grammatical errors to student writers by identifying word bigrams with low mutual information (i.e., identifying word cooccurrences with unexpectedly low probability). The C-Rater™ short answer scoring system, described in U.S. Patent Publication No. 2003/0200077 by Leacock et al., which is incorporated herein by reference in its entirety, automatically scores short-answer questions by matching answers with an instructor rubric by using word similarity scores derived from corpus cooccurrence frequencies to support detection of paraphrase. Each of the E-Rater™, Criterion™ and C-Rater™ systems are the property of the Educational Testing Service.
An instance in which cooccurrence data is used independently of linguistic structure is Latent Semantic Analysis (LSA), which makes use only of word cooccurrence within the same document to produce calculations of semantic similarity. LSA similarity scores are generated by applying singular value decomposition to matrices representing the log of raw word frequency by document. The resulting matrices can be used to generate cosine similarity scores indicating how similar two words are in their distribution across documents, or how similar two documents are in their choice of vocabulary.
Generally, cooccurrence-based similarity metrics seem to correlate with various psycholinguistic measures. However, when cooccurrence-based methods, such as LSA, fail, their failures are generally unlike degraded human performance (e.g., LSA judgments of semantic similarity can return highly valued word pairs where no reviewer can perceive a connection) and the correlations with human judgments are sometimes relatively weak.
While cooccurrence data alone can provide approximately 70 to 90 percent accuracy in some applications, such as parsing, and in complex applied tasks, such as essay scoring, improvement beyond such accuracy cannot likely be achieved without resort to additional linguistic measures. This is so because, for example, the addition or subtraction of a single word can completely change the interpretation of an entire expression. Accordingly, the limitations of systems depending solely on cooccurrence data are evident.
Extensive literature addresses systems that use cooccurrence data to measure the distributional similarity of words. Such systems typically collect cooccurrence statistics, such as bigram and trigram frequency counts, word by document frequency counts or frequency of word-word relationships from a grammatically analyzed corpus. Some systems then apply an analytical step, such as singular value decomposition, to improve the quality of the data. A similarity or dissimilarity metric, such as cosine similarity, the Kullback-Leibler divergence or the like, is then applied to yield a ranking which estimates the degree to which any pair of words have similar or dissimilar distributions.
Such systems have well known limitations and problems. First, the results are only as good as the corpus used for training. Second, the results are far more reliable for common words than for words that are more rare due to a scarcity of data. Finally, these systems ignore important linguistic distinctions such as the difference between different senses of the same word. Accordingly, the outputs of such systems are typically noisy (i.e., words/phrases having low similarity often appear in result lists).
What are needed are methods and systems for improving the accuracy of text analysis over methods and systems solely using lexical cooccurrence.
A need exists for methods and systems of automatically analyzing text using measurements of lexical structure.
A further need exists for methods and systems for determining the fundamental organizational properties of grammar.
The present invention is directed to solving one or more of the above-listed problems.