Much work has been done to develop systems in which representations of documents are used to facilitate information retrieval. Many of these systems rely on large numbers of training texts that have been pre-labeled by a human. Other systems utilize clustering or latent semantic indexing to categorize documents but are limited to a single category per document. Still other systems, such as the present invention, are capable of assigning multiple topics to a single document without relying on pre-labeled training texts. These systems typically represent the topical content of a document in a structured format, such as a vector or matrix.
Compact, structured representations of documents have long been used in information retrieval systems. [Salton, G. The SMART Retrieval System Experiments in Automatic Document Processing. Prentice Hall, Inc., Englewood Cliffs, N.J., 1971]. These usually take the form of multidimensional vectors in which the dimensions are words or phrases taken from the document, and the magnitude of a dimension is a measure of the frequency of occurrence of the word or phrase in the document.
In the simplest case, a document's vector is computed by making an entry in the vector for each unique word occurring in the document, and setting the magnitude of that word equal to the number of times it occurs in the document.
Variations of this simple case include filters on the words selected for inclusion in the vector (for example, eliminating words that do not contribute to a specific topical representation of the document, such as conjunctions; or eliminating words not present in a predetermined lexicon), stemming of words prior to computing frequency measures, as well as variations of the frequency measure itself, such as by inverse document frequency.
Also used by practitioners of vector-based information retrieval are vectors whose elements are not single words or phrases, but pairs or, more generally, groups of words or phrases. These are usually selected on the basis of co-occurrence, by which words or phrases are grouped as a vector element if they occur together, i.e. consecutively, in the document. The measure used to assign magnitudes to the elements of a vector formed in this manner is a relation on individual eligible words or phrases from the document. The value of this relation for a particular group of words is taken as the magnitude of the corresponding vector element.