Technical Field
The present invention relates generally to information processing and, in particular, to dictionary based social media stream filtering.
Description of the Related Art
Dictionary-based linguistic analysis has become commercially important. The better-known applications of such analysis are usually described as “sentiment analysis”, in which a large body of social media is compared with the following two dictionaries: (1) a dictionary of positive words; and (2) a dictionary of negative words. Each posting in the social media receives a score based on word-matches to these two dictionaries. A posting that has a higher positive score is considered to indicate positive sentiment, and a posting that has a higher negative score is considered to indicate negative sentiment. When the social media stream is carefully targeted (i.e., with a mention of a particular brand), then the aggregate positive or negative score can have importance as commercial intelligence.
Regarding the use of such dictionaries, typically, in prior art approaches, researchers assemble multiple dictionaries, and score each social media posting using all of the dictionaries. The next step is collecting some form of “ground truth” for a particular concept of interest. As used herein, “ground truth” refers to a “true” estimate of the accurate classification or attribute, which can then be predicted on the basis of other attributes. Then, a sophisticated analysis is conducted using multiple regression methods or machine-learning, to reduce the wide diversity of dictionaries down to a small number of crucial concepts that are the best predictors of that “ground truth” data. Thus, existing approaches to addressing the problem of applying dictionaries to social media are highly quantitative, and are the domain of experts.