Currently, semantic capabilities are used in Web applications, such as tools for searching and browsing, data summarization, data reorganization, and automatic inferences of logical relations in the data. These capabilities are associated with heavy requirements and costs for providing “exhaustive” amounts of metadata. Yet, part of the responsibility for providing the metadata can be placed on the users. For instance, some Web-based services allow users to tag Web documents of interest for sharing or recalling by assigning one or more keywords to the documents. Data obtained from the tagging can be used to describe the documents and enhance document searches.
However, many social Web repositories, such as del.icio.us and Flickr, make available only sparse amounts of data and metadata, and authors are not encouraged to provide semantically rich content via tagging due to a lack of return value. Therefore, without the appropriate metadata, the benefits of the semantic capabilities that augment various Web applications cannot be offered. A vicious circle is created where authors are not motivated enough to provide semantically rich content because they do not see enough return value in the current applications and the semantic capabilities cannot offer the potential benefits until enough metadata is made available. This vicious cycle can be broken by enabling automatic extraction and reuse of metadata from the new growing volume of data made available by social streams in social networking or micro-blogging tools such as Twitter, Yammer, Facebook, and MySpace. For example, in the Twitter system, during 2012, about 500 million Twitter users generated between 300 to 400 million tweets per day. Further, a study of Twitter in 2011 found that about one out of every five Twitter messages includes a uniform resource locator (URL) and the text in the tweet is generally a comment about the URL. Thus, the URL and text can include useful metadata, as provided in Lichan Hong, Gregorio Convertino, Ed H. Chi. Language Matters In Twitter: A Large Scale Study. In Proceedings of ICWSM 2011.
Thus, there is a need for a system and method to automatically extract and reuse existing metadata to provide semantic capabilities for characterizing and clustering message content.