This invention relates, in general, to a method and system for associating, extracting or mapping content found on multiple websites related to a specific business and, in particular, to a method and system for crawling multiple websites containing one or more web pages having information relevant to a particular domain of interest, such as details about local restaurants, extracting content from such websites, such as hours, location, phone number as well as reviews, reviewers, review dates and user tags and associating the extracted content with a specific business entity.
Since the earliest days of the internet, major search engines have used special software robots, known as spiders, to locate information and build lists of words, found across multiple websites, in order to provide search services to end-users. Google, Yahoo, AltaVista and Excite are each examples of well known generalized search engines. In order to populate their individual databases, each search engine uses a crawling mechanism, which operates transparently to the end-user, to gather information about individual websites and web pages.
In general, web crawling is the automated and methodical process of collecting content from multiple sites on the Internet. This collected data is then used by the initiating search engine to create an index that can then be searched by an end-user through a web-based interface. However, as the Internet expands to hundreds of millions of websites, and billions of individual web pages, the amount of content and number of pages that need to be discovered and analyzed, as well as regularly refreshed and reanalyzed, places a significant burden and high overhead, in terms of financial costs and data processing resources, on the search engines that process this information. In contrast, the instant invention describes a system that only extracts information relevant to the domain of interest, and offers efficiencies in cost and scalability for processing and presenting information.
While every generalized search engine works in a slightly different and often proprietary manner, typically, statistical-relevance based web crawlers, such as Google, Yahoo and other open source crawlers, index individual web pages as follows. First, when the search engine arrives at a website it looks in the root (main) folder of the site for a file called robots.txt. In the robots.txt file it looks for what directories and files it is allowed to look at and index. Once the crawler finds a web page, it takes a look at the head section (the tags between the <head> and </head> tags) of the web page for:                1. The title of the page        2. The keyword and description meta tags        3. The robots meta tag        
Web page content is defined as all content located between the <body> and </body> tags. This where the crawler looks for the keywords and phrases defined in the page's keywords and description meta tags. The crawler also finds and follows links embedded within the web page content. Typically a crawler reads the content in the order it is found on a page, from top to bottom. Generalized search engine spiders look at all the words in a web page and record where they were found, but each search engine has a different way at looking at the web page. For example most crawlers look at the <head> section of the html document and retrieve the <title> and <meta> sections. These sections are used to provide sufficient context through keywords and other instructions to permit the crawlers to process and organize the information retrieved. Most crawlers will also analyze all the words found on a page as well as click patterns generated by users accessing that page to determine which content on a web page is most relevant. Generally, page subtitles, image <alt> tags, text links and text link title tags are also used to infer the core substance of a web page. Finally, spiders from Google, Yahoo, MSN and other companies apply their own proprietary rules to organize and index the retrieved content.
While the concept of general search engines, and their value in searching and displaying information for consumer use, based on meta tags and keywords, has been refined by companies like Google, there is a new class of vertical search engines that focus on indexing and searching information in a targeted category of interest, such as local health care providers or travel information and planning resources. The concept behind these vertical search engines is to aggregate and organize information in such a way that a consumer will find it easier to obtain relevant information in one place and, more importantly, to provide a search experience where the results are limited to only that information relevant to a domain or category of interest.
The travel and real estate fields are each good examples of established industries where consumers rely on vertical search websites such as kayak.com, orbitz.com, and tripadvisor.com for travel searches, and zillow.com, and trulia.com for searches related to real estate. These sites organize information in a way that is more intuitive and specifically targeted for the type of transactions or time sensitive information a searcher may be looking for. Their value is also recognized by search engines such as Google, which index and rank these vertical sites highly due to the value they provide to consumers. As a result, a search conducted through a vertical search engine will generally provide a more focused, useful and richer consumer experience when compared to the same search conducted through a generalized search engine. For example, when conducting a simple search for a local business, such as a restaurant, in Google or Yahoo, the top ranked responses will typically be from those sites that have some relevant content, but will also likely contain much irrelevant information, forcing the user to construct a long and complicated search expression. In addition, because of the limitations inherent in keyword indices, there may be many local information sources containing timely and relevant content about the sought-after restaurant that should be considered by the searcher, but generally these sites will not be returned, regardless of the search query, since they are not easily indexed or integrated into the generalized database constructed by a statistical-relevance based search engine.
In addition, while browser searching has traditionally taken place at a desk in an office or at home, the proliferation of mobile computing platforms including cell phones and smartphones, such as the iPhone and BlackBerry, many of which include GPS or other location-determining technologies, has resulted in a class of mobile users who can benefit from a crawling and analysis method and system as taught by the invention, wherein a broad range of search results associated with a specific local business can be obtained. For example, as noted above, the best results for a local search query may not necessarily be keyword oriented, so traditional search engines, such as Google, which rely on meta tags and keywords and are not very capable of deep crawling such content automatically, will not do a good job of extracting relevant information and displaying it to the consumer. In addition, since websites are all constructed differently, statistical search engine strategies do not readily provide a scalable solution around this limitation.
By way of example, if we consider the case of a traveler accessing Google through their mobile phone to search for information about the best Italian restaurant in a particular city, their search may retrieve multiple pages of possibly relevant links, but there is no easy way for the consumer to analyze the results to arrive at an answer of what is the ‘best Italian restaurant’ with a high degree of confidence. In addition, an authority of what is ‘best’ in one area may not necessarily be an authority another, which is especially true in the case of local businesses, so a search strategy that works well for Italian restaurants in San Francisco may not work well for Italian restaurants in Chicago or London or Milan.
Some search engines, such as Google, attempt to provide a way for ratings and other relevant local business information to be accessed, but only if a user conducts their search though a dedicated local business search portal. However the results returned by these local business searches are generally limited to data “claimed” by the business owner, which may well be biased, or provided by a third-party content provider through some form of structured API integration with the search engine, such as the GoogleBase API. As a result, each of these search scenarios limits the information available to the searcher to only that which has been provided by an interested third party, and therefore also limits its utility.
Other search engines have attempted to permit the association of a website with a known business, but these systems do not provide a mechanism to capture information from multiple sources, both structured and unstructured, about a known local business and then extract the captured information and associate, or link the extracted information with the local business. One such example of such a system is that illustrated in US Patent Application 2005/0149507 to Nye. The Nye application is directed to identifying a URL address of a business entity by matching attributes about that entity, such as a phone number, to possible URL's and then selecting the URL that most likely is associated with that phone number, in a kind of reverse-webpage directory. While the Nye application talks about permitting a user to be able to look for local restaurants, for example, and discover their official web pages, this is a capability that already exists in search engines such as Google. However, Nye does not provide a consumer or business with the ability to search for, capture, extract and confidently associate information about a specific local restaurant, and associate only the relevant results with that restaurant from, across multiple sites, official, commercial and enthusiast, both structured and unstructured.
Returning to the discussion of searching for relevant review information in a mobile application while traveling, while its possible for a consumer to manually search each of the local review sites they are aware of to try and assemble a ‘global’ opinion about a particular restaurant, or cuisine, in a particular city, this process can be time consuming and, if a searcher is unaware of a particular site or blog, due to their unfamiliarity with the area, they may miss searching it completely. In addition, when a consumer is traveling and conducting a local business search through their mobile device, having to visit multiple sites to gather information is unacceptable. This problem is understood by the travel industry, where websites such as kayak.com or orbitz.com offer significant consumer value by aggregating airfares or hotel rates from multiple carriers or lodging providers. However, for consumer searches where opinion and review data is often as, or more, important than raw pricing information, the brute force method of aggregating information employed by kayak.com, for example, will not work.
As a result, it is recognized that when searching for local business information, such as local restaurant information, it will be very helpful to an end user, enhancing their search experience, to aggregate content from review sites such as citysearch.com, local.yahoo.com etc., together with professional reviews found on newspaper and media sites such as sfgate.com, nymag.com, sanfran.com, as well as with reviews from restaurant-related blogs such as Becks and Posh (becksposhnosh.blogspot.com) and discussion boards and forums such as chowhound.chow.com. Such a search engine is provided through the website BooRah.com which employs natural language processing technology to generate quantitative scores for domain specific attributes derived from plain English text, and further provides automatic summaries from the most relevant user sentiments to enable an end user to perform highly customizable search based on personal preferences. A key element of the BooRah search engine is a comprehensive database containing all of the kinds of content and reviews noted above, and therefore benefits greatly from the method and system for identifying, collecting, analyzing mapping and extracting relevant information, including reviews and opinions, as well as corresponding attributes, such as specific reviewer identification, review date and review rating associated with a specific local business, taught by the instant invention.
Accordingly, the need exists for an improved method and system for crawling, mapping and extracting information from web pages where the extracted information can be mapped to a specific business. The invention teaches a method and system that collects and extracts relevant information associated with a specific local business from multiple and diversified online data sources such as dedicated online review sites, blogs, newspaper and professional review sites and other types of discussion boards. The invention comprises a semantic crawling mechanism designed to identify and, if deemed relevant, to extract reviews, pictures, meta content and other domain specific attributes from those sites identified as being pertinent to a particular field, or domain, or interest. The invention further comprises an entity mapping mechanism that can associate the extracted content with an actual business, the results of such association enabling the population of a domain specific database that can then be used with a user-friendly search mechanism to allow an end user to search for relevant information about a particular search domain, such as local Italian restaurants, with a high degree of precision and ease.