The present invention relates generally to determining and discerning items with multiple meanings, and more particularly, but not by way of limitation, to an unsupervised and non-dictionary based system, method, and recording medium for determining “senses” in which a word appears and associating the words with particular occurrences in the input sequence.
Conventional techniques of determining and discerning items with multiple meanings of words have proven useful in applications such as discerning multiple meanings. Generally, success depends on using external resources such as dictionaries, thesauri and user input. To date, distributed representations have not been used in discerning meanings although they proved useful in other domains such as analogy determination. However, despite many usages suggested for word vectors, their internal structure remains opaque. That is, the conventional distributed determining and discerning items with multiple meanings do not make possible to discern between words/items with multiple meanings with any reliability simply because it is unknown how to decode vector entries in distributed representations.
Conventional techniques of determining and discerning items with multiple meanings systems cannot discern items with multiple meanings with high reliability without external resources. For example, the word ‘party’ may appear in a sequence. By examining the distributed representation of words in the sequence, conventional systems cannot discern between the multiple meanings of the word “party” which could mean a celebration, a group of people, a political association, etc.
Discerning the different meanings of words in a corpus has been the subject of much research. Conventional methods have proposed an unsupervised method based on word vectors to discriminate between senses of a word, without labeling (or explaining) these senses. The method involves no learning although it suggests using SVD decomposition to reduce vectors' dimensionality. Each word w is associated with a word vector (different than the ones used in this patent) whose dimensions are words occurring in a window around the word w. The vector entry in a dimension is the number of occurrences of the word of this dimension in the windows around occurrences of w in the learning text. Given a particular word w to be disambiguated, one forms context vectors for w over the training text. A context vector is simply the normalized average vector of the word vectors of words in a window around an occurrence of w. The collection of context vectors for w is partitioned into clusters and the average vector of each cluster represents the cluster. Given a w occurrence in some test text, a context vector c for this occurrence of w is constructed. The cluster whose representative vector is cosine-closest to c defines the sense of this “new” w occurrence in the test text.
Thus, there is a technical problem in the conventional determining and discerning items, using distributed representation, with multiple meanings systems as they have no capability to discern between words having multiple meanings with an explanation of said meanings. More specifically, the conventional methods have the technical problem that the conventional methods do not have two trainings, one over the original text and one over the modified text in which each word is replaced by its “sense’ so as to explain the ‘senses’ using the text itself. Further, the conventional methods do not use a class average context vectors as trained in this application as such vectors are not present in the conventional techniques. Accordingly, the conventional systems cannot help a user to comprehend the sequence even without a dictionary.