Social streams have become ubiquitous in recent years because of a wide variety of applications in social networks. This has resulted in an almost continuous creation of massive streams of data. For example, in some social networks, users communicate with one another with the use of text messages. This results in massive volumes of text streams. The text messages may be reflective of user interests. Thus, the particular social network may be able to leverage these text streams for a variety of mining and search purposes.
In chat and email networks, users send messages to one another. This too creates large streams of data. Some social networks have a very large number of users who may communicate with one another. As a result, the volume of text streams across a social network can be extremely large.
Many media-sharing sites contain the ability for users to make comments about the media content. Such data can also be considered social streams.
There are well known ways to collect the entire social stream (or a sample thereof) traveling across a given social network.
Typically, the social stream may experience concept drift, in which the key patterns in the underlying stream may change over time. This means that the training models may become outdated over time.
Since training models may become outdated over time, they may need to be constantly updated (or updated at a relatively high frequency) in order to ensure accurate results for classification.
In some instances, additional information about the social context of the underlying social stream is available. Social information, also known as linkage information, may provide an understanding of how different social actors in a social network are related to one another. Thus, the presence of linkage information may provide a considerable amount of feedback to the classifier based on an understanding of how the different social actors are related to different classes within the classifier.
Since the nature of the social actors may be closely related to a label used for classification, this information may help in the classification process. At the same time, it also creates an additional challenge for the classification process, because the linkage information needs to be used effectively for classification.
Another additional challenge with social streams is that they are typically very noisy, and often contain many incorrectly labeled instances, thereby making a classification based on that label inaccurate. Hash tags, for example, are generally used to label groups and topics in a social network. For instance, hash tags may be used to mark individual messages as relevant to a particular user or group of users, and to mark individual messages/documents as belonging to a particular topic. However, as is well known, hash tags can contain many incorrectly labeled instances.
For example, in Twitter™, hash tags may be used to label some of the documents for a particular topic, but this information is often quite noisy. A user may incorporate a hash tag for a message (e.g., a tweet in Twitter™), for which the content may not necessarily belong to the particular topic. At the same time, a tweet which does not contain a specific hash tag may also belong to a relevant topic.
As a result of the aforementioned noisy classification problems, it is typically very challenging to relate a class behavior of a test instance (e.g., an unlabeled or unclassified instance) of a social stream to the content of the social stream. When combined with the fact that social streams need to be classified very fast with incremental and online methods, this creates a very challenging scenario for the classification process.
The problem of text stream classification arises in the context of a number of different information retrieval (IR) tasks, such as, for example, news filtering and email spam filtering. Text streams have been widely studied, both in the context of the problems of clustering and classification. The problem of classification of data streams has been widely studied by the data mining community in the context of different kinds of data.
One conventional method for classifying text streams in which the classification model may evolve over time is a temporal weighting factor. For example, a temporal weighting factor may be introduced in order to modify the classification algorithms.
Specifically, this approach has been applied to the Naive Bayes, Rocchio, and K-nearest neighbor classification algorithms. It has been shown that, if the underlying data is evolving over time, then the incorporation of temporal weighting factors is useful in improving the classification accuracy. However, these classification algorithms are ineffective for the classification of social streams, because they are not designed for social stream classification with the use of contextual information.
Another conventional method includes one-class classification of text streams, in which only training data for the positive class is available, but there is no training data available for the negative class. This is quite common in many real applications in which it easy to find representative documents for a particular topic, but it is hard to find the representative documents in order to model the background collection. This conventional method works by designing an ensemble of classifiers in which some of the classifiers corresponds to a recent model, whereas others correspond to a long-term model.
A number of neural network methods have also been adapted to the stream scenario. In these methods, the classifier starts off by setting all the weights in the neural network to a same value. The incoming training example is classified with the neural network. In the event that the result of the classification process is correct, then the weights are not modified. On the other hand, if the classification is incorrect, then the weights for the terms are either increased or decreased depending upon which class the training example belongs to.
A Bayesian method for classification of text streams constructs a Bayesian model of the text which can be used for online classification. The key components of this approach are the design of a Bayesian online perceptron and a Bayesian online Gaussian process, which can be used effectively for online learning. However, none of these methods are designed for social stream classification with the use of contextual information.
Accordingly, the present inventor has recognized that there is a need for a system, method and computer program product for the classification of social streams with the use of contextual information.