The invention generally relates to messaging in social networks, and more particularly relates to searching and retrieving messages using messaging context and keyword frequency.
The Internet is a tremendous source of information, but finding a desired piece of information has been the preverbal “needle in the haystack”. For example, services like blogs provide data miners a daunting task of perusing through extensive amounts of text in order to find data that can become applicable for other uses. Hence, text data mining and information retrieval systems designed for large collections of lengthy documents have arisen out of the practical need of finding a piece of information in the massive collections of varied documents (such as the World Wide Web) or in databases of professional documents (such as medical or legal documents). Likewise, with the popularity of social networking increasing every day, the amount of user-generated content from these social networking sites continues to grow. Thus, finding information that is relevant and useable is quickly becoming more difficult.
These popular social networking services or options, like Twitter messages or Facebook statuses, are typically much shorter in length than full web pages. This brevity however makes it increasingly difficult to use current filtering techniques specifically designed to sort through large amounts of data. For example, popular techniques, such as term frequency-inverse document frequency (TF-IDF) weighting, are dependent on both the collection of information, as well as the average document size, to be large.
Additionally, in recent years there has been an increase in the number of very short documents, usually in the form of user generated messages or comments. Typical user generated messages have come from a number of sources, for example, instant messaging programs, such as AOL instant messenger; online chat rooms; text messages from mobile phones; message publication services, such as Twitter; and “Status” messages, such as those on Facebook pages. Thus, with the rising popularity of these messaging services, there has become a need to search the messages for their content. Some techniques of searching short messages consist simply of doing regular expression matching. However, these techniques typically fail when a term being searched is ambiguous and/or used in unrelated topics. For example, searching for “Amazon” could result in finding messages about the Amazon river and the online retailer, Amazon. Also, if additional terms are provided, many relevant messages may be omitted. For example, searching for “Amazon river” would not match the message “Hiked to the Amazon today—what a beautiful jungle this is”, whereas a webpage or a large document about the Amazon River would likely contain both the words “Amazon” and “river”, while a short message may not.
Additionally, due to the tremendous volume of messages flowing through a social media network, the number of messages that can be stored over a period of time can be quite substantial. Searches looking for a particular word or words in messages can result in a similarly large search result of identified messages within a relatively small time period. For example, the more common the term the shorter the time period and/or the larger the number of most recent messages. Also, as previously noted, ambiguous terms, e.g., “Amazon” or “tool”, can also cause additional issues e.g., false positives.
Accordingly, there is a need to provide a message searching and retrieval system to identify relevant short messages while overcoming the obstacles and shortcomings previously noted and recognized in the art.