There are numerous applications where the ability to understand opinions expressed in documents is critical. For example, political campaigns may wish to understand public sentiment about a romantic affair by a candidate running for office, or entities may wish to assess the degree of harshness of news articles about them in the news media. Likewise, Government agencies may wish to gauge the strength of public sentiment about topics of national importance. The ability to gauge opinion on a given topic is therefore of a vital interest.
There are numerous techniques developed to analyze opinions. For example, polling agencies canvas people for opinions. Although providing a sufficient assessment of opinions, opinion polling is an extremely expensive approach.
An alternative to the polling technique is studying of opinions expressed in various kinds of document collections, such as for example movie reviews described in Franco Salvetti, et al., “Automatic Opinion Polarity Classification of Movie Reviews”, Colorado Research in Linguistics, June 2004, Volume 17, Issue 1, Boulder: University of Colorado, pp. 1-15. Unfortunately, the opinion analysis described in Salvetti, et al. produces exclusively a binary score, e.g., on “recommend”/“don't recommend” binary scale. In the case of movie opinion genre this scoring approach may be sufficient.
However, in many other cases such binary scoring is not an adequate approach for opinion analysis. Strength of opinion, rather than “yes/no” binary score, is a more desirable outcome of any opinion analysis technique. For example, a CEO of a major corporation may wish to track the intensity of opinion expressed about the corporation in news wires during times of criticality. The strength of opinion would be needed in such cases. It is clear that for many cases, a mere “yes/no” binary score technique may be insufficient.
Unfortunately, most of the currently available opinion analysis systems are based on a binary score approach. In addition to Salvetti, et al., referred to in the previous paragraphs, Peter D. Turney, “Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews”, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistic (ACL), Philadelphia, July 2002, pp. 417-424, describes an algorithm developed for classifying a document as “positive” or “negative”.
Turney's algorithm is built for reviews, such as movie reviews and car reviews, where the rankings “recommended” or “not recommended” appears to suffice. The system defines adjectives in terms of each other, finds opposites and synonyms depending on the conjunctions used when the adjectives are applied to the same noun, and uses subjective human ratings to test the accuracy of the method.
Another system, described in Bo Pang, et al., “Thumbs Up? Sentiment Classification Using Machine Learning Techniques”, Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2002, pp. 79-86, performs an opinion analysis based on classifying reviews as “positive” or “negative”, experimenting with “Naive Bayes”, maximum entropy, and support vector machines based algorithms. Again, the technique proposed in Pang, et al. is more directed to finding a polar relationship rather than ranking an adjective (or document's) intensity across a continuum of values.
Another opinion analysis system presented in V. Hatzivassiloglou, et al., “Predicting the Semantic Orientation of Adjectives”, Proceedings of the 35th Annual Meeting of the Association for Computational Linguistic, 1997, pp. 174-181, is based on the semantic orientation (“positive” or “negative”) of adjectives. Similar to Salvetti, et al., Pang, et al., and Turney, the words are scored on 2-polar scale, thus lacking the intensity based ranking of the opinion related words.
Binary scoring is not always sufficient to accomplish the goals, and a more comprehensive and accurate opinion analysis approach capable of calculating opinion strength ranking is desirable in many situations.
An attempt to assign a degree of intensity to an opinion expressed in a document on a topic of interest is presented in T. Wilson, et al., “Just How Mad Are You? Finding Strong and Weak Opinion Clauses”, Proceeding AAAI-04. This is a learning based method for classifying opinion based on identifying and leveraging a set of clues and some novel syntactic features specifically developed for this purpose. Although this is a more advanced system than the binary ranking opinion analysis techniques, Wilson's approach, however, lacks the ability of computing opinion intensity on a continuous numerical scale, since only 4-5 qualitative definitions of opinion strength are assumed, e.g., “neutral”, “low”, “medium”, “high”, and “extremely high”.
Further, Wilson's approach is heavily dependent on human annotators involvement, which analyze and rank sentences “manually” in detail. This annotation phase is very time consuming and expensive.
Additionally, Wilson's system, although scoring sentences found in documents, does not explicitly describe computations of the opinion score for the entire document based on the scored sentences.
As is readily known to those skilled in the art, the currently available opinion analysis approaches are each somewhat limited to a specific scoring technique and are not adaptable to alternative ones, thus lacking flexibility and generality desirable in the analysis of documents extracted from data sources from a wide spectrum of the media.
Therefore, a comprehensive opinion analysis system is needed which is capable of automatically computing intensity of opinion expressed in a document on a particular topic on a continuous numerical scale in a time and labor efficient manner and which is flexible enough to be easily adaptable to any opinion scoring technique or to be capable to “plug-in” a plurality of such techniques.
Each of the prior opinion analysis techniques is capable of analyzing opinion in only one language at a time. Such an approach is somewhat deficient since it lacks a desirable flexibility to operate efficiently in a multi-lingual domain. For opinion analysis systems, it appears to be important to be able to simultaneously deal with opinions expressed in multiple languages. For example, multinational corporations sell products in numerous countries around the world, each of which has newspapers in their particular national language(s), blogs expressed in the national language(s), etc. Such a multinational corporation would be interested in obtaining a competent and sufficient opinion analysis on its products. If an opinion analysis system has to be reengineered for each and every language, the costs of creating an opinion analysis as well as the time involved in reengineering and developing of a new opinion analysis system for each new language may be unacceptably high.
It is therefore extremely desirable to have an opinion analysis system capable of simultaneous operation in multi-lingual domains without reengineering the system for each language.