The present application is directed to automated classification and more particularly to automated sentiment classification, where sentiment classification is understood to be a specific type of text categorization that works to classify the opinion or sentiment of information, such as in the form of text, as it relates to a particular topic or subject.
Two typical approaches to sentiment analysis are lexicon look up and machine learning. A lexicon look up approach normally starts with a lexicon of positive and negative words. For instance, ‘beautiful’ is identified as a positive word and ‘ugly’ is identified as a negative word. The overall sentiment of a text is determined by the sentiments of a group of words and expressions appearing in the text.
A comprehensive sentiment lexicon can provide a simple yet effective solution to sentiment analysis, because it is general and does not require prior training. Therefore, attention and effort have been paid to the construction of such lexicons. However, a significant challenge to this approach is that the polarity of many words is domain and context dependent. For example, ‘long’ is positive in ‘long battery life’ and negative in ‘long shutter lag.’ Current sentiment lexicons do not capture such domain and context sensitivities of sentiment expressions. They either exclude such domain and context dependent sentiment expressions or tag them with an overall polarity tendency based on statistics gathered from certain corpus such as the world wide web accessed via the internet. While excluding such expressions leads to poor coverage, simply tagging them with a polarity tendency leads to poor precision.
Because of these limitations, machine learning approaches have been gaining increasing popularity in the area of sentiment analysis. A machine learning approach such those using Support Vector Machine (SVM) does not rely on a sentiment lexicon to determine the polarity of words and expressions, and can automatically learn some of the context dependencies illustrated in the training data. For example, if ‘long battery life’ and ‘long shutter lag’ are labeled as positive and negative respectively in the training data, a learning algorithm can learn that ‘long’ is positive when it is associated with the phrase ‘battery life’ whereas it is negative when associated with the phrase ‘shutter lag’.
However, the success of such an approach relies heavily on the training data. For the task of sentiment analysis, data scarcity is an inherent issue that cannot easily be solved due to the richness of natural language. Particularly, people tend to use different expressions to express the same sentiment, and also tend not to repeat their sentiments in the same sentence or document. Consequently, it is very difficult to collect training data that adequately represents how people express sentiments towards various subject matters. This data scarcity issue has resulted in relatively low accuracy for sentiment classification compared to some other text classification tasks.
Therefore, although recent studies have shown that machine learning approaches in general outperform the lexicon look up approaches for the task of sentiment analysis, ignoring the advantages and knowledge provided by sentiment lexicons may not be optimal.
However, few studies have been devoted to combining these two approaches to improve sentiment classification. Some have explored using a general purpose sentiment dictionary to improve the identification of the contextual polarity of phrases. A few other recent studies have shown that incorporating a general purpose sentiment lexicon into machine learning algorithms can improve the accuracy of sentiment classification at the document level. In all of these works, a general purpose sentiment lexicon contains words with context/domain independent polarities. The present sentiment classifier system and method differs from these previous approaches.