Event summarization using the social media data streams is a challenging task that has not been fully studied in the past. Existing work on automatic text summarization often focus on the news articles, as driven by the annual evaluation of DUC (Document Understanding Conference) and TAC (Text Analysis Conference). However, the news articles represent a text genre that is drastically different from the social media text. The news are often produced by the professional writers with well-polished sentences and grammatical structures. When the sentences are extracted from the documents and concatenated to form a summary, the text is often in good quality since the sentences are mostly self-explainable. For example, some social networking services including Twitter provide a service that enables users to post short messages. Observers of an event often use these services to post short messages about the event. An event with a large number of observers can often generate a large number messages that include a great deal of useful information about the event. On the other hand, the messages from are produced by a wide range of observers with different backgrounds. The messages are typically short and notoriously noisy, containing a wide variety of non-standard spellings, abbreviations, acronyms, spelling errors, and the like. When the individual messages are taken out of the conversational thread to form an event summary, the process of interpreting the meanings of the individual messages from different observers is difficult in the greater context of the event.
Compared to a static collection of news articles, the messages that describe the event also exhibit temporal fluctuations. The messages form a dynamic text stream and pulse along the timeline. The messages also cluster around important moments (a.k.a. sub-events) which represent a surge of interest from the observers. In generating the event summaries, it is crucial to identify these sub-events and include the corresponding information in the summary. Existing solutions address the problem by monitoring changes in the volume of messages and apply a peak detection algorithm to identify the sub-events. However, this may not work well since (1) the volume changes are often not easily identifiable, and (2) the identified peak times can correspond to one or two key event participants who have dominated the entire event discussions. For example, the key players in a basketball game such as Kobe Bryant can lead to high tweet volumes from the observers of the game. The general discussion of more well-known players and game highlights can overshadow other players in the game and other key sub-events which do not always garner the same volume of messages from the observers. Consequently, improvements to message analysis and summarization systems that provide summaries of events from social media messages while observers of the event produce the messages would be beneficial.