A bag-of-words (BOW) representation for text documents is known as one of the most effective vector representations for text classification and clustering. In this representation, each word corresponds to one dimension in an n-dimensional weight vector, where n is the number of words in vocabulary. The i-th dimension of the weight vector represents a weight of the i-th (i=1, . . . , n) word in the vocabulary. As word weights, the number of times the word occurs in a text document (frequency count) or other weighting schemes like tf-idf (Term Frequency, Inverse Document Frequency) can be used.
However, one major drawback of the BOW representation is that it does not model semantic relationships between words. For example, a text which contains a word “park” but not a word “meadow”, might actually be considered similar to a text which contains the word “meadow” but not the word “park”. However, the BOW representation is not able to consider that “meadow” and “park” can be synonymous, and thus fails to detect that two texts like “I love playing soccer in the park.” and “I like playing in the meadows.” are similar.
One solution to this problem is to map the text representation to lower dimensional semantic space using, for example, Latent-Semantic-Indexing (LSI) or Latent Dirichlet Allocation (LDA), as described in NPL 1. However, these methods cannot make use of word similarity information that is known a-priori.
As a different solution that is able to use such word similarity information, a bag-of-clusters (BOC) representation is disclosed in NPL 2.
The BOC representation uses given word similarities to create word clusters, and then represents a text using these clusters.
FIG. 10 is a diagram illustrating an example of the BOC representation. In the example of FIG. 10, four clusters C1:={“meadows”, “park”}, C2:={“soccer”, “baseball”}, C3:={“play”}, and C4:={“love”} have been created for vocabulary {“love”, “play”, “soccer”, “park”, “baseball”, “meadows”}. For the text “I love playing soccer in the park”, a BOC representation {f_C1=1, f_C2=1, f_C3=1, f_C4=1} is generated, wherein f_Cx indicates a weight for the cluster x. For the text “I love playing baseball in the meadows”, the same BOC representation is also generated. The BOC representation can be regarded as a BOW model where all words that are grouped together have the same weight, as shown in FIG. 10.
Note that, as a related technology, PTL 1 discloses a method for reducing an appearance frequency of a word for a document data based on a style to which the document data belong. PTL 2 disclosed a technology for complementing evaluation expressions missing in a sentence by using feature data of sentences including and excluding the evaluation expressions. PTL 3 discloses a method of speech recognition using continuous Makaranobis DP (Dynamic Programming).