With the proliferation of content on computer networks it is increasingly useful to have a variety of ways of understanding and organizing content. It is common to understand and organize content by topic, author, relevance, popularity, date, etc. There also is an increasing interest in automated tools that attempt to discern the attitude or sentiments of the author toward the subject of the document, such as whether these attitudes are positive, negative or neutral, and how strong these attitudes or sentiments are. For example, one might want to locate strongly positive reviews of a movie or travel destination.
There are several techniques for processing documents to determine if sentiments expressed in a document are positive or negative. In general, the techniques involve using documents with associated sentiment judgments, and from those documents learning to associate words and phrases with a sentiment magnitude and polarity. Then, phrases are identified in a document, and then the document is scored based on the sentiment magnitudes and polarities found in the document. There are a variety of computational techniques to achieve these results. For example, see Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan, “Thumbs up? Sentiment Classification using Machine Learning Techniques,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79-86, 2002, and subsequent work. These techniques are commonly used for scoring an entire document, although they can be extended to scoring sentences within a document by treating each sentence as if it were a distinct document.
There also are several techniques for processing documents to find names of different kinds of individual entities (most commonly personal names, geographical names, and organization names) in a document. In general, the techniques involve either looking for occurrences of names from a list within a document, or searching the document to find a set of contexts and features that statistically predict where the names of entities are located in the document. Each entity in the document can be associated with a label from the set of labels found in the annotated training corpus. There are a variety of computational techniques for identifying entities in documents. For example, see McCallum, Andrew and Wei Li, “Early Results for Named Entity Recognition with Conditional Random Fields, Features Induction and Web-Enhanced Lexicons,” in Proc. Conference on Computational Natural Language Learning, 2003, and subsequent work for further information about the statistical approach to learning to identify named entities.
A newer problem in document analysis involves assigning sentiments values (polarity and magnitude) to entities identified in a document. The problem with most techniques is that sentiment polarity is assigned to an entire document or sentence, whereas all entities in a document or sentence do not necessarily share the same sentiment polarity as the document or sentence as a whole.
One attempt to address this problem is a graph-based approach to using sentiment polarity and magnitudes associated with phrases that are related to an entity in the document to determine a sentiment for the entity. See Fine-Grained Subjectivity Analysis, PhD Dissertation, by Theresa Wilson, Intelligent Systems Program, University of Pittsburgh, 2008.