The present invention relates to textual classification systems, and in particular, to social network message categorization and identification systems and methods.
The Internet is a tremendous source of information, but finding a desired piece of information has been the preverbal “needle in the haystack”. For example, services like blogs provide data miners a daunting task of perusing through extensive amounts of text in order to find data that can become applicable for other uses. Hence, text data mining and information retrieval systems designed for large collections of lengthy documents have arisen out of the practical need of finding a piece of information in the massive collections of varied documents (such as the World Wide Web) or in databases of professional documents (such as medical or legal documents). Likewise, with the popularity of social networking increasing every day, the amount of user-generated content from these social networking sites continues to grow. Thus, finding information that is relevant and useable is quickly becoming more difficult.
These popular social networking services or options, like Twitter messages or Facebook statuses, are typically much shorter in length than full web pages. This brevity however makes it increasingly difficult to use current filtering techniques specifically designed to sort through large amounts of data. For example, popular techniques, such as term frequency-inverse document frequency (TF-IDF) weighting, are dependent on both the collection of information, as well as the average document size, to be large.
Additionally, in recent years there has been an increase in the number of very short documents, usually in the form of user generated messages or comments. Typical user generated messages have come from a number of sources, for example, instant messaging programs, such as AOL instant messenger; online chat rooms; text messages from mobile phones; message publication services, such as Twitter; and “Status” messages, such as those on Facebook pages. Thus, with the rising popularity of these messaging services, there has become a need to search the messages for their content. Some techniques of searching short messages consist simply of doing regular expression matching. However, these techniques typically fail when a term being searched is ambiguous and/or used in unrelated topics. For example, searching for “Amazon” could result in finding messages about the Amazon river and the online retailer, Amazon. Also, if additional terms are provided, many relevant messages may be omitted. For example, searching for “Amazon river” would not match the message “Hiked to the Amazon today—what a beautiful jungle this is”, whereas a webpage or a large document about the Amazon river would likely contain both the words “Amazon” and “river”, while a short message may not.
Accordingly, there is a need to provide messaging categorization systems and methods to identify relevant social network messages while overcoming the obstacles and shortcomings previously noted and recognized in the art.