Text-based social data, such as blogs, are more and more prevalent, and deep within social data, useful information may be contained. Indeed, social data may include valuable information about a particular product as seen in the social data network, i.e. what comments have users made with respect to a particular product or service. In addition, social data such as blogs or tweets may be related to a company, or to a sector of activity of a company.
The persons who generate text-based social data on web logs (blogs), social networking sites, or any online forum, are referred to as “bloggers”. Bloggers produce a variety of different types of information, such as personal diaries, experiences (such as food, travels), opinions (on products, services, people, politics and politicians), to name but a few.
One aspect of blogging is the unregulated, spontaneous and collective expression of ideas. The collective information produced by bloggers is significant, in that an analysis of this data can provide insight into public opinion on products, political views, companies, entertainer, public figures, etc. In a sense, the blogosphere can become a source of competitive intelligence for analyzing the usefulness of a given marketing campaign; public relations strategies, public response to a given product or service, and the like.
In contrast to web pages or wikis, social data is linked with a date stamp and publisher information, which provides a temporal reference point and an associated actor. This date stamp can be used to track and analyze over time the information generated in social media. The temporal aspect of social media is also interesting in that a post, or a number of posts, can trigger additional posts by the same or other bloggers.
Another aspect to social data entries is that over time, one or more persons may become “influencers” to the greater community. For example, a blogger who regularly writes about a particular issue, and gathers a large following, will generally be more influential than an ad hoc blogger, or one who may write regularly on unrelated topics.
However, there are some considerable issues in trying to sort through the information contained in social data, in order to produce insights. First of all, social data is often not neatly categorized by topic. Secondly, social data often contains spam, or other undesirable entries, such as porn, which makes searching through blogs irritating at best, and misleading at worst.
There exist many different search engines on the Internet, one of the examples being blogs search engines. Blog search engines enable searching through the blogs have been previously indexed. However, raw searching rarely produces useful results, or produces so many results that it is far too time consuming to manually sort through the results. Even a well designed search filter often yields far too much information. Also, traditional search engines are based on crawlers, which index vast amounts of information, but these crawlers are not adapted to index social entries that are additionally defined by their temporal aspect.
Social Data Providers offer APIs (application programming interfaces) so that third parties may access the information catalogued and indexed. Of course, this access is “raw”, i.e. it is unformatted and unorganized.
Another term known the art is “fire hosing”. In fire hosing, one would go directly to a source of information such as Google or Twitter, and “get” all the information related to a given query, or queries. Then the data is cleaned up. This technique has the disadvantage or getting considerable amounts of information, and is non-discriminatory, in that it would include, inter alia, spam and porn.
It is also known in the art that the current rate of expansion of information on social media networks roughly doubles every 6-8 months.
Some social data providers have developed “crawlers”, software applications that “crawl” the web, and perform indexing functions. In some cases, information contained on social media websites can be indexed by these crawlers, but the indexing that is performed is rudimentary. Another drawback of crawling is that given the vast amount of information that is contained in social media, the index is often outdated, sometimes by as much as 6 months. This means that if the conventional wisdom rule of expansion of social media of doubling every six months or so is true, then crawlers will miss about half the relevant information.
There have been attempts to address these and other issues, for example, as described in US patent application no. US 2009/0319518 A1 to Koudas et al. Koudas et al. teach a method for searching text sources which include temporally ordered data objects. In the method, access is provided to the text sources including the temporally-ordered data objects. A search query based on terms and time intervals is obtained or generated, in addition to obtaining or generating time data associated with the data objects, which are then identified based on the query. Koudas et al. then generate a popularity curve based on the frequency of data objects corresponding to one or more of the search terms in the one or more time intervals.