1. Field of the Invention
The present invention relates to techniques for comparing textual information. More specifically, the present invention relates to a method and an apparatus for representing the semantic content of textual information within a document to facilitate document comparison operations.
2. Related Art
A wide range of computer-based applications involve comparison of textual data. For example, applications that perform information retrieval, information filtering and text classification in documents typically involve a large number of comparisons of textual data. In existing systems, a vector-space representation of the documents is typically computed and then the inner-product or cosine between the document vectors is used as a measure of similarity between the documents. The input data for such systems typically consists of a static document collection which can be used to compute a vector-space representation of the documents. It is also desirable for such systems to include a method for computing a representation for out-of-collection documents, such as user queries for information retrieval operations.
Finding a meaningful vector-space representation for textual data is a difficult problem. One of the most widely used approaches is to represent documents as vectors in a multidimensional space in which each dimension corresponds to one vocabulary term (see G. Salton and M. J. McGill, “Introduction to Information Retrieval,” McGraw-Hill, New York, 1983). The measure of similarity between two documents used under this representation is based on matching terms that occur in the documents.
Unfortunately, this representation has drawbacks. It loses all word order information by treating documents as “bags of words.” More importantly, this representation treats all terms as orthogonal dimensions and thus disregards any semantic connections between terms within a document. Consequently, this representation cannot be used to deal with language phenomena such as ambiguity, synonymy and polysemy directly.
Hence, what is needed is a method and an apparatus for representing textual data to facilitate comparison operations without the above-described problems.