1. Technical Field
The present invention relates to document classification and, more particularly, to document classification using supervised weighted n-gram embedding.
2. Description of the Related Art
The task of document classification is defined as automatic assignment of one or more categorical labels to a given document. Examples of document classification include topic categorization, sentiment analysis, and formality studies. A document may include a sentence, paragraph, or any snippet of text—the term is defined herein to encompass all such objects.
Previous techniques applied to this task are either generative or discriminative supervised methods. Discriminative document classification techniques commonly rely on the so-called “bag-of-words” (BoW) representation that maps text articles of variable lengths into a fixed-dimensional vector space, parameterized by a finite vocabulary. The BOW model treats a document as an unordered collection of word-features and utilizes the distribution of the words as the primary evidence for its classification. The “bag-of-unigrams” is the most common form of BoW representation that utilizes a word dictionary as its vocabulary.
Some classification attempts have employed short phrases as being more effective than single words (unigrams) for the task. Extending the “bag-of-unigrams” model by incorporating n-grams (a contiguous sequence of n words in the vector space representation of the text). However, the complexity of modeling n-grams grows exponentially with the dictionary size. Considering the n-gram cases for an English word dictionary  of size ||, bi-gram and trigram representations of text relate to ||2 and ||3 free parameters.
Despite the simplicity and relative success of document classification using n-gram features, previous models disregard all the spatial and ordering information of the n-grams—such information is important for text data. In the example of sentiment analysis, phrases with strong polarity might be more important in deciding the polarity of the whole document. For example, a document with the phrase “generally good” at its beginning is very likely to be a positive sentiment. When the same phrase appears in the middle of another text, the document is less likely to be a positive comment. Similarly, the start and the end of sentences or paragraphs in an on-line news article might contain more critical and subjective information than its other parts. To completely capturing such relationships would require full semantic understanding, which is beyond the current state of technology.