Data representation methods are known which represent an object as a collection of independent items disregarding the relationship and structure between the items. An example of such a method is the so-called “bag-of-words” (BoW) method.
The bag-of-words method has been widely used in natural language processing (NLP) where the data'object is provided by the text documents and the items are an unordered collection of words which occur in the documents. It is also used in the computer vision area, where the data object is an image and the items are codewords from a codebook.
The BoW method expresses the document as a bunch of individual words. However, the bag of words may not sufficiently express the semantic information sufficiently. For example, if there is a document containing the date “September 11”, neither “September” nor “11” correctly conveys the real information of the document. If the BoW method is used to select similar documents, many unrelated documents such as “oceans eleven” or “black September” are selected. Meanwhile, because of polysemy, a word may lead to some semantic ambiguity. For example, “curry” can be a kind of food or the name of an IT shop. This kind of ambiguity influences the performance of BoW type analysis.