In the current networking environment, Twitter has emerges as a rich source of local news from around the world, with over 340 million tweets reported each day across 200 countries. Many events of local importance are first reported on Twitter, including many that never reach news channels. Further, there are often only a few tweets reporting each such event, in contrast with the larger volumes that follow events of wider significance. Even though such events may be primarily of local importance, they can also be of critical interest to some specific but possibly far flung entities.
For example, any business enterprise can potentially be affected by events that impact any entity in its eco-system, such as customers, partners, governments, competitors, etc. Considering an instance of a fire in the factory of a remotely located supplier half-way around the world, can disrupt an enterprise's supply-chain and cause significant delays and losses. Especially in today's globalized world, it is becoming increasingly important for an organization to continuously sense the external world for events on potential interest, as well as extract sufficient information about such events so as to assess their possible impact on its affairs.
Along with the increasing utilization of social media, Twitter in particular has become a rich source of breaking news, including news that, is local and possibly of limited interest to a wider global audience. Such events may in fact never make it to any news channel, certainly not a global one. At the same time, taking an example of a fire in a supplier's factory, many such events may indeed be of interest to specific but far away entities. Detecting such events amongst the nearly 340 million tweets per day is equivalent to ‘finding a needle in haystack’. Such events need to be sensed from a stream of unstructured short-text messages (tweets) arriving at a rate of tens of messages per second. It has to be acknowledged that detection of local news events nevertheless may be of tremendous operational value when correlated with the internal operations and transactions of even a far-flung enterprise.
Next, since the number of messages per event is small, their detection by observing trends on keywords, as prevalent in most techniques for event detection from Twitter is inconceivable. Most of the approaches used in the art assume that many tweets arrive on the same event so that it can be detected by a rising trend on a keyword. Another architectural proposition by Shroff, Agarwal and Dey in 2011 aims to achieve information fusion in the enterprise context called ‘Enterprise Information Fusion’. However, one of the pre-requisites for such a system, as also mentioned therein, is the ability to detect structured event-objects containing precise information about each event which has always remained a challenging task.
Recent publications on twitter event detection can be segregated based on (a) nature of events or (b) the detection techniques used. Events can be very specific such as natural calamities, accidents, sports related, marketing events or a cultural extravaganza. For example, Sakaki et al., show how to detect earthquakes (Sakaki, Okazaki, and Matsuo 2010). On the other hand, events can be generic, often referred to as ‘breaking news’ as discussed by Phuvipadawat et al. (Phuvipadawat and Murata 2010), as also by Weng et al. in their paper (Weng and Lee 2011). Amongst the techniques used, one approach is to detect events from clustered tweets as discussed in (Becker and Gravano 2011), whereas (Weng and Lee 2011) proposes clustering followed by feature extraction. However, since each tweet is very short it usually covers only one aspect of an event (e.g., its location, or severity etc.) and therefore clustering tweets based on word-similarity results in only those describing the same aspect of the event being grouped together.
In the former approach that requires clustering of the tweets first, one has to wait for more tweets to arrive before detecting an event. Similarly the techniques used therein would therefore fail to detect sparsely reported events, which can be of critical importance. This demands acute need of a more improvised classification that can detect such sparsely reported events, effectively discard irrelevant tweets and consequently improve processing efficiencies. To be fair, none other related work has ever focused on capturing and processing sparsely reported events for information extraction that can be vital to any enterprise.
Another challenge usually faced in the analysis of these tweets is that of correlation across multiple messages. If different words are used to describe the same event, correlation between messages through word-to-word similarity does not work no matter what technique is used. Correlation requires the extraction of properties/concepts (e.g. time of occurrence, location, etc,) to create a stream of structured event-objects. Popularly known techniques of information-extraction fail to give good results in this particular scenario because of the informal and abbreviated language usually used tweets.
Furthermore, Natural Language processing technique for information extraction from well-formed English prose has been fairly successful (Finkel, Grenager, and Manning 2005), (Kristina et al. 2003). However, information extraction from the informal language used in Twitter has still not been possible with similar level of accuracy.
Finally, even after one identifies potential event-objects, many of these still represent the same real world event so it is important to further correlate such potential events and merge them.
In the light of foregoing, there exists a need for a system that can consume stream of sparsely reported tweets of different types and convert them to event-objects for specific event types that can represent the events of interest to an enterprise.