Sentiment analysis and emotion classification are natural language processing applications for determining from text, the attitude or emotion of the author of the text. Sentiment analysis determines whether the author is feeling positive or negative towards the subject of the message. Emotion classification determines a specific emotion displayed by the author (such as “happy”, “sad”, “anger”, etc.). These methods are very useful when it comes to analysing large amounts of data (such as messages posted on social networks) to determine public sentiment towards a given product or idea.
Each method classifies input text based on the confidence that it falls within a given class (e.g. “happy”, “sad”, etc.) of a predefined number of classes. The text is classified into the class which it is most likely to fall within according to a pre-trained machine learning classifier or a rules based classifier. The machine learning classifier is pre-trained with training data via machine learning in order to pinpoint the patterns for classifying text. The training data is often manually labelled so that, during training, it is known which class each part of the training data falls within. Alternatively, distant supervision may also be to automatically assign training data to various classes based on conventional markers.
Before text can be classified, it must first be tokenised. Tokenisation breaks up the input text into a number of tokens to be processed by the classifier. Tokens may be assigned on the word level or the character level, that is, text may be split up into individual words or characters. Generally, tokens may be phonemes, syllables, letters, words or base pairs depending on the data being classified.
Tokens can be combined into a sequence of tokens to form n-grams. The n-grams are then input into the classifier for sorting into a target class.
Go, A.; Bhayani, R. & Huang, L. (2009), ‘Twitter Sentiment Classification using Distant Supervision’, Processing, 1-6, the entire disclosure of which is incorporated herein by reference, describes the use of various machine learning algorithms (naïve Bayes, maximum entropy and support vector machines (SVMs)) to classify Twitter™ messages via sentiment analysis. Emoticons are used as noisy labels indicating positive or negative sentiment. Input text is tokenised using a combination of unigrams and bigrams on the word level.
Purver, M. & Battersby, S. (2012), ‘Experimenting with Distant Supervision for Emotion Classification’, EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, 482-491, the entire disclosure of which is incorporated herein by reference, describes using conventional markers of emotional content within the text being classified as a surrogate for explicit labels to avoid the need to label training data manually. Twitter™ messages are tokenised using unigrams at the word level and classified using a support vector machine (SVM) into one of six emotions (happy, sad, anger, fear, surprise and disgust).
Yuan, Z. and Purver, M. (2012), ‘Predicting Emotion Labels for Chinese Microblog Texts’, Proceedings of the ECML-PKDD 2012 Workshop on Sentiment Discovery from Affective Data (SDAD 2012), 40-47, the entire disclosure of which is incorporated herein by reference, describes detecting emotion in Chinese microblog posts by assigning text to one of seven emotions via support vector machine classification. Tokenisation is performed at the character level.
It can be useful to be able adjust the filtering based on an end user's needs, for instance, to let more data be assigned to a given categories. Whilst the above methods are effective at classifying data, they operate under a predefined set of rules and thresholds. Should the user wish to adapt the classifier (for instance, to be more sensitive to a given class), then the classifier would need to be completely retrained. Accordingly, there is a need for a more customisable means of data filtering.