In recent years, web-based micro social behavior application platforms, such as microblogs, have emerged as a completely new interactive messaging environment. A blog is a regularly updated website, web page, or other messaging system in which posted messages may be viewed by subscribers and/or public at large. A microblog differs from a traditional blog in that its content is typically much smaller, in both actual size and aggregate file size. A typical microblog entry may be a short sentence fragment about what a person is doing at the moment or may be related to short comment on a specific topic. A microblog service enables users to send and receive short messages, which may include text, audios, images and/or videos, to and from a recipient or a group of recipients. In one instance, in the case of Twitter, the short messages are text-based messages of up to 140 characters, which are commonly known as “tweets.” A user may act as a microblogger to issue messages related to any topic on his microblog, and may also act as a fan to remark on messages issued by other users on other users' microblogs.
However, an effective communication of the shared content heavily depends on its effective organization. This is particularly important for microblogging websites due to the diverse nature of the content shared. One possible solution to organize the real-time content is to classify it into topics of interest. However, such classification of microblogs poses several challenges. Posts are short (usually 140 characters or less), often with abbreviated terms or grammatical shortcuts, that differ from the language structure on which many supervised models in machine learning and natural language processors are trained and evaluated. Effectively modeling content on microblogs requires techniques that can readily adapt to the data at hand and require little supervision.
Existing topic detection methodologies are generally based on probabilistic language models, such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation. For example, latent variable topic models have been applied widely to problems in text modeling and require no manually constructed training data.
However, the existing models and classification techniques either use external linked content or explicit user profile information to generate the training models, and hence are not reliable. This is because, in the case of Twitter for example, only about 20-25% of tweets contain external links and many times user profile information is not available due to privacy settings. Furthermore, existing topic models typically need to know the number of topics of a document at the beginning of a training procedure utilized for training the topic models. This may have a drawback of making the topic model inflexible and difficult to determine the topics.
The current disclosure discloses a system and method to build supervised trained models for microblog classification that address the above limitations, and use of such models for automated microblog classification.