The epic growth of the social web has created rich data sources for predictive analytics. The enormous volume and diversity of information propagating amongst large user communities on micro-blogging platforms, Twitter in particular, and the emergence of social media data aggregation service providers such as GNIP, Topsy, and StockTwits, enable new, intriguing opportunities to leverage the information content embedded in social media. In addition to providing huge volumes of minable data for diverse commercial applications, social media sources enable real-time detection, surveillance, and estimates of social media signatures for events and entities, thus extending the transformation of these data beyond estimation of the social media “sentiment” expressed for an entity.
The current practice of social media analytics has strongly focused on techniques to estimate the sentiment component of an entity's signature. These techniques employ methods from the discipline of Natural Language Processing (NLP), in varying degrees and complexity, to produce coarse grained sentiment estimates, oriented in terms of “Positive”, “Negative”, or “Neutral” for an entity. One disadvantage of coarse gained estimates is that they are highly sensitive to the thresholds selected to determine the entity's possible three state outcome. Further, such discrete estimates are not suitable to time series normalization techniques, which allow the detection of changes from an entity's normal sentiment levels or the comparison of one entity's sentiment level to another entity's level on a common measurement basis. The level of social media activity is also an important component of an entity's social media signature. Current analytic techniques estimate an entity's level or intensity of social media activity by counting the total number of “mentions” of the entity present in micro-blogging data streams observed during a time interval or converting the total to an average value observed over the interval. However, an activity metric driven solely by the total number of counts does not readily show significant deviations from the entity's normal level of social media activity, nor does such a representation allow for comparison of one entity's activity level to another entity's level on a common basis. An example from the domain of stock trading is Apple Inc., stock symbol AAPL, which is consistently the most frequently mentioned stock on Twitter. A comparison of the total number of tweets observed during a day for AAPL to the total number observed for a stock with less activity, such as Caterpillar Inc. (CAT), is not useful because tweet volumes for AAPL will always dominate, compared to CAT, over any observation interval.