One of the most important and most difficult tasks in marketing is to ascertain, as accurately as possible, how consumers view various products. A simple example illustrates the problem to be solved. As the new marketing manager for BrightScreen, a supplier of LCD screens for personal digital assistants (PDAs), you would like to understand what positive and negative impressions the public holds about your product. Your predecessor left you 300,000 customer service emails sent to BrightScreen last year that address not only screens for PDAs, but the entire BrightScreen product line. Instead of trying to manually sift through these emails to understand the public sentiment, can text analysis techniques help you quickly determine what aspects of your product line are viewed favorably or unfavorably?
One way to address BrightScreen's business need would be a text mining toolkit that automatically identifies just those email fragments that are topical to LCD screens and also express positive or negative sentiment. These fragments will contain the most salient representation of the consumers' likes and dislikes specifically with regard to the product at hand. The goal of the present invention is to reliably extract polar sentences about a specific topic from a corpus of data containing both relevant and irrelevant text.
Recent advances in the fields of text mining, information extraction, and information retrieval have been motivated by a similar goal: to exploit the hidden value locked in huge volumes of unstructured data. Much of this work has focused on categorizing documents into a predefined topic hierarchy, finding named entities (entity extraction), clustering similar documents, and inferring relationships between extracted entities and metadata.
An emerging field of research with much perceived benefit, particularly to certain corporate functions such as brand management and marketing, is that of sentiment or polarity detection. For example, sentences such as I hate its resolution or The BrightScreen LCD is excellent indicate authorial opinions about the BrightScreen LCD. Sentences such as The BrightScreen LCD has a resolution of 320×200 indicates factual objectivity. To effectively evaluate the public's impression of a product, it is much more efficient to focus on the small minority of sentences containing subjective language.
Recently, several researchers have addressed techniques for analyzing a document and discovering the presence or location of sentiment or polarity within the document. J. Wiebe, T. Wilson, and M. Bell, “Identifying collocations for recognizing opinions,” in Proceedings of ACLIEACL '01 Workshop on Collocation, (Toulouse, France), July 2001, discovers subjective language by doing a fine-grained NLP-based textual analysis. B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification using machine learning techniques,” in Proceedings of EMNLP 2002, 2002 use a machine learning classification-based approach to determine if a movie review as a whole is generally positive or negative about the movie.
This prior art makes significant advances into this novel area. However, they do not consider the relationship between polar language and topicality. In taking a whole-document approach, Pang, et al. sidesteps any issues of topicality by assuming that each document addresses a single topic (a movie), and that the preponderance of the expressed sentiment is about the topic. In the domain of movie reviews this may be a good assumption (though it is not tested), but this assumption docs not generalize to less constrained domains (It is noted that the data used in that paper contained a number of reviews about more than one movie. In addition, the domain of movie reviews is one of the more challenging for sentiment detection as the topic matter is often of an emotional character; e.g., there are bad characters that make a movie enjoyable.) Weibe et al.'s approach does a good job of capturing the local context of a single expression, but with such a small context, the subject of the polar expression is typically captured by just the several base noun words, which are often too vague to identify the topic in question.