The popularity of blogs has increased at a significant rate over the last few years. It is estimated that the size of the “Blogosphere” in August 2006 was one hundred times larger than three years previous. According to the same estimates, blogging activity is doubling in size every two hundred days or about once every six and a half months. The weblog tracking company TECHNORATI™ reports that as of August 2006 it has been tracking over 50 million blogs. Without a doubt, blogging is a social phenomenon, which is gaining popularity across several age groups, with the bulk of blogging activity generated by people in the age group of 13-29.
At the same time the popularity of social networking sites has also been increasing steadily. It is estimated that currently social networking site MySpace has 130 million registered users and that FACEBOOK™ has approximately 70 million. Reports project that 250 million people will be on social networks by 2009.
The activity in the sphere of blogs has led to the coining of new words. The term Blogger refers to an individual contributing content in digital form to web logs, social networking sites or any online forum. The term Blogosphere references the collection of web logs, social networking sites and any forum and medium of online content contributed to by individuals.
Bloggers produce diverse types of information. General topics include personal diaries, experiences (such as those collected through traveling or concerts), opinions (for example, those invoked by products, events, people, music groups, businesses, etc.), information technology, and politics, to name but a few of the vast topics canvassed by blogs. This information is highly significant as the Blogosphere is an unregulated collective and it evolves by the contributions of individuals. Collecting, monitoring and analyzing information on blogs can provide key insights into public opinion on a variety of topics, for example products, political views, entertainment, etc. Analysis of blogs can also identify events of interest, based on their popularity in the Blogosphere. Moreover, it can be a source of competitive intelligence information. Analysis can also provide insights on the usefulness and effect of marketing campaigns in the case of products, public relations strategies, public figures, etc. As such blog analysis offers opportunities for tracking the dynamics of public opinion. As a result, techniques that aid the collection, analysis, mining and efficient querying of blogs are significant. This is especially true due to the growing popularity of blogs and the fact that this trend is expected to persist.
Traditional web search technology can be readily applied on the Blogosphere. Indeed, numerous search sites exist, specializing in the Blogosphere. The flaw of the application of traditional web search technologies to the Blogosphere is that they fail to take into consideration the differences between crawling the World Wide Web and the Blogosphere. Information in blogs has a well defined temporal dimension that is not present in more traditional web content (i.e. html pages). Blog posts have a time-stamp and may trigger additional posts by the same or other bloggers. The temporal dimension in particular, imposes an ordering facility on the Blogosphere that it can be utilized for effective querying of blogs.
For example, consider a search for information related to the actor “Phillip Seymour Hoffman” on the Blogosphere. The functionality that a traditional search engine offers is a list of all blogs posts containing the search string, ranked in some order, as described in U.S. Pat. No. 6,772,150 and U.S. Pat. No. 7,315,861. Although this is informative, in terms of information discovery greater functionality can be achieved in the case of blogs (or any other temporally ordered streaming text sources, for that matter).
The result of the growing popularity of blogs and the proliferation in the number of people maintaining blogs is an increased interest in search and analysis engines for the Blogosphere. These engines use a variety of techniques for information discovery and text analysis. For example, a popularity curve is a graphical visualization of the popularity of a searched query within a temporal window. Popularity curves can be used for analysis, as fluctuations in popularity can provide insight into topics related to a query.
Specifically, TECHNORATI™, BLOGPULSE™ and ICEROCKET™ are online search resources that have the ability to display popularity curves for user queries. Popularity curves can be used to provide a drill down or roll-up style interface thereby allowing the user to easily restrict the search to a specific time interval. The system and method of curves applied by BLOGPULSE™ provides such an interface, while those of TECHNORATI™ and ICEROCKET™ do not.
However, none of the existing blog analysis tools provide any feedback about time-specific events of interest on their popularity curves. Moreover, other inventions that do recognize time-specific events of interest do not do so in a manner that is linked to a popularity curve, as exemplified by U.S. Pat. No. 7,188,078. This makes the task of information discovery tedious. A system that can identify time-specific events of interest would therefore be of assistance to a user.
The system and method of GOOGLE TRENDS™ provides information about the popularity of different keywords in GOOGLE™ search volume. However, since these popularity curves are based on search volume, and not on text content, the functionality to expand or collapse a temporal window is not available. GOOGLE TRENDS™ can also label parts of the popularity curve based on spikes in volume of news stories for a particular keyword. However, these labels, while informative, are difficult to use due to the lack of a navigational interface to facilitate selection of time intervals for analysis. Moreover, these labels are not based on data displayed on the popularity curve, but on a separate data source.
As well, known blog analysis systems and methods are limited with respect to the use of correlated keywords. Many search sites, including GOOGLE™ and TECHNORATI™, use their search volume to identify related queries. However, search volume is available solely for popular search sites and is inaccessible for most others. Other inventions establish correlations between keywords through reliance upon past queries, as is the method of U.S. Pat. No. 7,287,025, instead of focusing upon the content of a present query. These methods distort the range of related query suggestions.
The system and methods of TECHNORATI™, ICEROCKET™ and U.S. Pat. No. 6,360,215 utilize a list of related “tags” with the searched query for navigation. However, the drawback to this approach is that because tagging requires manual effort by bloggers, most of the content in the Blogosphere is not tagged. Also, the number of tags for a document is usually less then 10, while actual content itself may contain thousands of words. Therefore, tags generally cannot accurately represent the contents of a document. An additional problem occurs because tags may be subjective or prone to spam.
Known methods and systems base their analysis on tags and search volume and not on actual text content. A more accurate means of examining blogs to determine search relevance is to consider the whole content of the document.
Moreover, known systems and methods additionally fail to account for restrictions on time range (as defined as a temporal window). Moreover, additional search parameters, such as geographical region or demographic information are engaged through an inefficient method reliant upon data associated with a text source which is not consistently available, as is exemplified by U.S. Pat. No. 7,231,405, wherein the invention is reliant upon geocodes.
The systems and methods of GOOGLE ALERTS™ and YAHOO ALERTS™ provide an alerts service whereby users can register a query with the system. Whenever the system (specifically the crawler) encounters a new document containing the specified query, it raises an alert and sends an email to the user. An alert function is also included in U.S. Pat. No. 7,143,118. This service is useful for monitoring specific items on the web, but it suffers from two main problems: (i) an alert is raised whenever any document (e.g., blog post) containing the query is encountered and not when an even to interest occurs; and (ii) if the number of documents containing the specified query is large then this technique will fail, because the number of alerts will be too many to handle.
The system and method of GOOGLE™ utilizes the number of inlinks to a page as a measure of authority. For example, GOOGLE™'s page rank algorithm makes use of such information. This measure has proven its effectiveness over time for web documents. However, this simple definition of authority ignores contextual and time-specific information and hence is generally inadequate for the Blogosphere, or any other temporally-ordered information source. A more informed authoritative ranking would be achieved by taking into account time, context, authority, and geographic information.
The system and method of TECHNORATI™, as well as other search sites, displays a list of “what is popular now” through an application of tags and search volume. There are two limitations to this approach: (i) this is based on search volume and tags and not on the actual content of posts and is therefore undesirable because tagging requires manual effort, the search volume is not always available, and tags are not always accurate representation of actual content; and (ii) the list of popular keywords cannot be generated for arbitrary time periods (e.g., 1 Apr. 2006 to 18 May 2006).
The system and method of GOOGLE TRENDS™ lists of top few cities and regions where the user specified query was most popular (in search volume). This is useful as keywords may have varying popularity across different regions in the world. It would further be useful if a search tool could display a map with regions marked according to the popularity of a search query. However, such a service is not provided by any tool for the case of the popularity of the query in the actual Blogosphere or on the actual content of temporally ordered information sources.
The systems and methods of GOOGLE ANALYTICS™ and CLUSTRMAP™ provide web analytic tools that use map-based visualization to display the number of visitors to visit a site from different parts of world. However, no tool provides such visualization for search results in the Blogosphere.
Known systems and methods apply inverted indexes for the purpose of providing search functionality within text documents. Such indexes suit the traditional web that consists of a collection of HTML documents, but not the new emerging social media. Special techniques are required to conduct efficient searching for attributes such as age, gender, and time of creation that are commonly found in documents in social media. Thus, efficient querying on attributes of a user in conjunction with keyword queries is a persistent problem. For example, conducting a search for all blog posts containing global warming posted in April 2007 by males aged 30-45 and with location within 50 miles of downtown Toronto, is beyond the capability of known technologies. Traditional indexing schemes wherein posting lists are created for each of the keywords in the corpus at the indexing time, and intersection of posting lists is computed at the query time. These work well when constraints on the metadata are absent.
Moreover, known systems and methods, such as those included in U.S. Pat. No. 5,819,260 and U.S. Pat. No. 5,146,405, contemplate means of formulating an additional query based upon the text of a specific document and the implementation of part-of-speech segmentation functionality. However, they achieve the additional query through a method that lacks sophistication and therefore fails to produce a meaningful query.
Finally, known systems and methods routinely apply primitive search interfaces. They lack features such as: one-click zoomable popularity and demographic curves; asychronously loading cached copy of search results in tooltips; automatic text summarization; and collaborative dashboards.
In view of the foregoing, what are needed are methods and systems for information discovery and text analysis of the Blogosphere, or other forms of social media and various temporally ordered information sources, that are not necessarily query driven, and that overcome the drawbacks and limitations of the prior art. For example, a user should be able to monitor posts and keywords of interest that merit further exploration should be automatically suggested.
Further, what is needed is a system and method that does more than solely monitor queries posed by users or blog post tags and rank them based on relative popularity. There is a wealth of related information one can extract from blogs in order to aid information discovery. For example, blog analysis can be a useful tool for marketers and public relations executives as well as others. They can be used, for example, to measure product penetration by comparing popularity of a product along with those of a competitor in the Blogosphere. Moreover, popularity can also be used to assess decisions, like marketing strategy changes, by monitoring fluctuations in popularity.
Additional functionalities, such as one-click zoomable interfaces, tooltips and intelligent alerts through the use of bursts can further enhance Blogosphere analysis. The list includes adding a spatial component to queries as well as correlations identifying temporal dynamics in the list of keywords correlated to a specific keyword, and mapping correlated keywords to topics. These functionalities and features have the potential to improve information discovery and text analysis of the Blogosphere or any other online temporally-ordered text sources.