Over recent years there has been rapid growth of on-line discussion groups and review sites on the World Wide Web (WWW). The content of “postings” to consumer-oriented forums largely relates to opinion expressed in the postings. Opinions authored by individuals, groups or organizations about various topics are a valuable resource for companies investigating market reaction to their own or a rival company's products.
In this context, there are two types of information that can be valuable as market information. Statistics on how much of the “talk” on the Web contains positive, negative or neutral sentiments towards a particular product, and the exact phrases used to express such sentiments. Consider a hypothetical example for “Car—Model DE”. The relevant statistical information may consist of statements such as: 40% of opinions on this car are positive, 20% are negative, while remaining 40% are neutral. As an example, positive expressions may be “an economical car” and “smooth drive”, negative expressions may be “poor performer on freeways” and “glitchy gear box”, and neutral expressions “German car” and “compact car”.
The task of manually tracking opinions about a particular topic from all Web documents is laborious. If one seeks opinions concerning a particular product, identifying the relevant documents in which they might occur can be difficult. The task becomes further labor-intensive if one is to extract opinions from the identified documents. Opinions may be scattered through a document, and may be expressed in subtle ways.
References Bo Pang et al and Turney et al each describe methods to determine the overall sentiment of a given document towards a given topic of interest using supervised classification methods. Relevant publication details are as follows. Pang, B. Lee, L. and Vaithyanathan, S. “Thumbs up? Sentiment Classificationusing Machine Learning Techniques”, Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79-86. Turney, P. D. “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews”, Proceedings 40th Annual Meeting of the Association for Computational Linguistics (ACL '02), pages 417-424, Philadelphia, Pa.
References Pang et al and Turney et al both describe use of unigrams and bigrams extracted from documents as features for classification. Analysis of the sentiment at the document level for identifying opinions can, however, lead to loss of information. As an example, consider reviews of the movie Lancelot of the Lake from an online review site. There are favourable comments about the movie such as “a fascinating cinematic experience boldly made by a master filmmaker”, “something rare in the modern cinema,” and “a truly personal film”. One may thus conclude that the review rates the movie as a “good” movie, though analyzing the document in this way may be misleading and may not necessarily reflect the diversity of views actually expressed therein.
To illustrate this point, the same document contains some very critical remarks about the actors such as “non-professional actors who recite the dialogue in emotionless flat voices”, and some unfavourable remarks about the opening sequence such as “is a series of clumsy, disjointed fights amongst anonymous knights”.
A need clearly exists for an improved manner of assessing sentiment expressed in textual matter in an automated manner.