This invention relates, in general, to a method and system for collecting and analyzing data mined from multiple sites on the internet, and/or stored in self-contained or pre-loaded databases and, in particular, to a method and system for capturing, extracting, analyzing, categorizing, synthesizing, summarizing and displaying, in a customizable format, both the substance and sentiment embodied within such data, particularly, but not exclusively, when such data comprises user-generated content, such as commentary or reviews or public feedback.
In today's internet-savvy world, people frequently conduct online research before making many of their traditional consumer purchasing decisions. For example, before buying a particular product or service, or when deciding among different kinds of similar products or services, individuals will frequently research the experiences that others have had in buying and using the same or similar product, taking into consideration both the performance of the product as well as the level and quality of customer service provided by a particular merchant or manufacturer. By way of illustration, before purchasing a digital camera, a user may narrow down his/her selection to two or three models through a simple filtering process based on price, features and availability. Then, before making a final decision on which product to purchase, they may go to one of the professional product review websites such as CNET.com or Popular Photography to research the pros and cons of that product in the eyes of a professional photographic equipment reviewer. They may also go to an enthusiast site such as dpreview.com, Steve's Digicams or Imaging Resource to gather additional information before making a final decision. These enthusiast sites, which are often run in a very professional manner and may contain detailed and methodical observations, provide a forum for reviews written by actual users who can discuss the real-world experiences of someone who has spent their own money on a product. In addition, because user reviews are a valuable source of information, both pre- and post-sale, companies with an online presence such as Amazon, Best Buy and Yahoo have made it easier for consumers to have access to such user reviews on the products they carry by soliciting recent purchasers for their opinions and then making it easy for new buyers to quickly browse through the comments of past purchasers.
At the same time that the internet continues to expand as a ubiquitous resource, online tools such as “blogs”, “wiki's”, and a number of social media applications, such as MySpace, Facebook and Live Journal, are making it easier for consumers to create online content without the need to understand the arcane process of coding web pages in HTML. These sites have become recognized as a legitimate source of news, opinion and information and, as a result of these tools, and the proliferation of these sites, user-generated content of all kinds is expanding at a rapid rate across the internet.
However, while the growth of user-generated content, in theory, ought to be very useful to anyone doing product or service research, since more data should mean a greater likelihood of finding a relevant discussion or review about a particular product or service, this is frequently not the case since, as noted above, the information being sought is commonly dispersed across multiple web sites having different interfaces and employing different online search tools. This diffusion of source and lack of uniformity in interface ultimately makes it difficult to find all or even most of the relevant content and, frequently makes it virtually impossible to skim though and understand relevant reviews quickly, assuming they can even be located, resulting in a degradation and frustration of the entire decision making process. In addition, the foregoing process can be complicated even further when trying to search for relevant user-generated comments or reviews narrowed or filtered on the basis of a specific personal or lifestyle preference.
The challenge of searching user-generated content can be better understood by examining the services of Google®, one of the best known generalized search engines. Google employs a key word search paradigm that is familiar, and therefore easily used by most casual users. While Google is effective when searching for websites on the basis of a few key words, for example a search for a particular topic, such as “the history of hats”, or a recipe for “mom's apple pie”, Google is less useful when searching for user-generated content, such as non-professional reviews, which may be embedded deep within a larger website. This is due to a number of factors that cannot be easily controlled. First, the “signal to noise” ratio for user-generated content tends to be very high due to the unstructured way in which many non-professional writers write, and as such, simple key word or phrase searching without extensive Boolean manipulation frequently results in a large numbers of “hits” that don't contain truly relevant information. Second, it is well understood that in order to solve the problem of how to best rank and present relevant search results, search engines such as Google create an index of websites, and associated, relevant pages returned, and then rank order these results based, in part, on the number of sites ‘linking back’ to a particular result, with a greater the number of links “back” indicating a more ‘relevant’ result, and therefore, a result which should be ranked more highly.
One problem when such a system is applied to user-generated content is that, for the most part, user-generated reviews will not point to each other, but will instead generally stand on their own, resulting in relevant search results being “buried” many pages down from the “top rated” results. Another problem is that the linking process is time consuming, resulting in a delay, perhaps as much as a month or more, before search indices are refreshed, rendering search results out-of-date before they are even available. Additionally, user-generated content is frequently less well focused than professionally generated content, so that a single entry in a blog, for example, may cover many different and disparate subjects, none very deeply, with the result being lots of ‘search hits’ without dislodging much useful information.
Google has attempted to address these problems by creating a second search engine currently known as blogsearch.google.com, but this is not always a convenient solution, since it obligates a user to visit multiple search engines in order to find desired results, and, regardless, does not address the “signal-to-noise” problem described above. As may be understood, as the universe of user-generated content grows, the “signal to noise” problem again becomes a significant factor, and extracting relevant user-generated reviews on a narrow topic in a specific domain of interest becomes unreliable at best.
Accordingly, the need exists for a specialized method and system that specifically addresses the problem of how to easily analyze user-generated content, in various forms, relevant to a particular topic, or related group of topics, and then provide the ability to search within this defined group, presenting a searcher with the most relevant information. In the context of current invention, relevant information can be thought of as, a) general sentiment indicators on a specific attribute for the product or service from all the reviews, b) a summary, or “gist”, of most relevant aspects from all user comments, condensed into an abstract that can help the searcher understand the condensed conclusion of the relevant reviews, enabling them to make a decision without the need to read through all the reviews, c) location indexed information to enable a user to easily narrow down choices their choices based on geography, and d) personalization of content, user interface and access technologies and portals (internet, phone, iPod, etc.) that enables a user to extract information based on their own customized “profile”.