Communication through short text updates has greatly increased due to the rise in popularity of social networking, which allows users to interact through online communities. Short text updates used in social networking, such as “posts” or “tweets,” differs from standard documents, including papers, publications, and reports. For example, short text updates are generally limited to a particular size measured by characters or words. Additionally, the short text updates are usually unstructured text that includes shortcuts, such as abbreviations and acronyms, to comply with the size restrictions.
Generally, each user has a social networking profile that includes a live stream of short text updates posted by and to the user. The short text updates received in a stream can rapidly accumulate such that identifying interesting and important updates becomes difficult. Currently, users have little control over the short text updates that they are able to view. Many users often resort to temporal sampling, which includes viewing the short text updates that are displayed at a particular time during which the user is logged in. However, the sampling process is unreliable and often times, important updates are missed.
Filtering of the short text updates by topic can assist in reducing the number of updates a user must review. However, due to the differences in short text updates and larger documents, conventional methods for identifying topics for a short text update are inadequate. For example, traditional techniques for identifying topics include word repetition detection and co-occurrence matrices, such as Latent Semantic Analysis. Word repetition detection techniques, such as term frequency-inverse document frequency (“tf-idf”) generally assume that the frequency or popularity of a term models the importance of that term. For example, the importance of a term increases the more times the term is identified in a document. However, in short text updates, terms are usually not repeated to conserve space and the topic of the short text update may not be included at all in the text. Further, the traditional techniques require large numbers of documents to find statistical patterns, which makes identifying topics for a single short text update or document impractical.
Analyses of short text updates have been performed, such as determining a similarity of short text snippets by Sahami and Heilman, “A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets,” In Proceedings of the 15th International Conference of World Wide Web (Edinburgh, Scotland, May 23-26, 2006). The text of each short text snippet forms a query provided to a search engine for identifying documents. A context vector is generated for the short text snippet using terms from the identified documents. The similarity of two or more short text snippets is determined by comparing the context vectors for each of the short text snippets. However, identifying topics for the short text snippets via a majority voting process is not provided.
Thus, a system and method for accurately identifying topics for one or more short text communications are needed.