The present invention relates generally to vector representation of words, and more generally items, in a language, and more particularly, but not by way of limitation, to an unsupervised and non-dictionary based vector representation system, method, and recording medium for aiding in understanding the meaning of word vector entries.
Conventional distributed vector representations of words have proven useful in applications such as solving analogy problems. Despite many usages suggested for word vectors, their internal structure remains opaque. That it, the conventional distributed vector representations do not make it possible to assign meaning to the dimensions of the vectors.
For example, one conventional distributed vector representation has attempted to train and obtain 200 dimensions vectors for English words. Then, the convention technique has tried to solve an analogy problem such as “king to man is like what to woman” by finding a closest vector to that of Vking−Vman+Vwoman (i.e., Vqueen).
However, the conventional distributed vector representations do not identify in each of these 200 dimensions Vking with clear properties such as status, gender, nationality, age, weight, hunting ability, historical period etc.
That is, there is a technical problem in the conventional distributed vector representation systems that they provide no capabilities to decode properties with words because the vector dimensions of the words have no clear semantic meaning. Accordingly, even when there is a dominant dimension (large absolute value) for a word vector, the intensity of the word on the particular dominant dimension has no apparent meaning.