The present invention relates in general to search and indexing systems and methods and in particular to search and/or indexing systems wherein geographic locations are associated with documents and processing of those documents depends on their associated locations.
The World Wide Web (Web), as its name suggests, is a decentralized global collection of interlinked information—generally in the form of “pages” that may contain text, images, and/or media content—related to virtually every topic imaginable. A user who knows or finds a uniform resource locator (URL) for a page can provide that URL to a Web client (generally referred to as a browser) and view the page almost instantly. Since Web pages typically include links (also referred to as “hyperlinks”) to other pages, finding URLs is generally not difficult.
What is difficult for most users is finding URLs for pages that are of interest to them. The sheer volume of content available on the Web has turned the task of finding a page relevant to a particular interest into what may be the ultimate needle-in-a-haystack problem. To address this problem, an industry of search providers (e.g., Yahoo!, MSN, Google) has evolved. A search provider typically maintains a database of Web pages in which the URL of each page is associated with information (e.g., keywords, category data, etc.) reflecting its content. The database might also index every significant word (i.e., all words except for stop words, such as “the”, “a”, “in”, etc.) in each document. The search provider also maintains a search server that hosts a search page (or site) on the Web. The search page provides a form into which a user can enter a query that usually includes one or more terms indicative of the user's interest. Once a query is entered, the search server accesses the database and generates a list of “hits,” typically URLs for pages whose content matches keywords derived from the user's query. This list is provided to the user as “search results”.
In the more general case, a corpus of documents is searched and a search engine provides documents deemed responsive to a query. It should be understood that documents in a document corpus could be considered pages and that other divisions of digitized items of information could also be designated as documents or pages for the purpose of search.
In some cases, because of the large number of search results, it is desirable to filter out some of the results and only provide more relevant results. One filtering that is useful in many situations is to filter pages based on the searcher's location, so that the search results are skewed towards pages relevant to the searcher's actual location or specified location. This car be done by having page authors include metadata indicating the relevant locations of their pages, but mostly this is not done and is often difficult to otherwise assess a relevant “location” for a document.
It is sometimes desirable to be able to search for information that is associated with a physical location. A typical example might be to find all of the pizza restaurants within a five mile radius of one's current location. One approach is to find products and services within some local area using a location-specific directory (e.g., “electronic yellow pages”). The information contained in these directories is generally compiled manually and often does not make any connection between the product or service and its web pages (if any exist). The information in yellow pages products is prone to be incomplete and out-of-date; it is also generally expensive to maintain.
As another example, a user might be searching for a restaurant in Sunnyvale that serves pot stickers. The information might be available, such as where a restaurant puts up a web site having a contact page showing its address in Sunnyvale and a restaurant menu page listing items offered, such as pot stickers, but if the restaurant does not include “Sunnyvale” on its restaurant menu page, a search engine might not provide the menu in response to the query “Pot stickers in Sunnyvale”.
An alternative to manually compiled yellow page data is to automatically extract similar information directly from pages on the web, thus eliminating expensive manual procedures. This can also result in more complete and timely data. One approach to this entails associating or tagging a geographical location to individual pages. Most pages do not contain any textual or other indication regarding the location of the item or items described on that page (or other such physical locus to be associated with the page). Due to the large number of pages in a typical corpus (the Web as a corpus includes billions of pages) as well as their ephemeral nature, manually labelling individual pages with locations would require a very large and continuous effort.
A typical technique for automatically associating a geographic location with a web page is to locate and parse any addresses contained within a page. However, an address mentioned in the text of a page may not actually represent the physical location of the items described on that page and many pages might have no explicit mention of an address.
Another common approach for associating a geographic location with a web page is to analyze whois or DNS information and attempt to relate IP address or hostname to a location. This method suffers due to inaccuracies in the registration data aLnd because it is often common for an Internet service Provider (“ISP”) to host web sites in locations that are nowhere near the physical locations of the sites they host. The inaccuracy makes this technique generally infeasible.
There have also been attempts to utilize metatag conventions for the specification of physical locations for pages, but inaccuracies and lack of usage makes this technique generally infeasible.
It is known to impute search terms from one document to another so that a query of two or more terms is satisfied by a document that does not contain all of the terms of the query if its “parent” document(s) do contain the missing terms. See, for example, U.S. Pat. No. 5,991,756 assigned to Yahoo! Inc. and entitled “Information Retrieval from Hierarchical Compound Documents”. While documents with a clear hierarchical organization can be easily searched using those techniques, it can be more difficult where it is not always clear which documents go with which documents or how authoritative the contents of one document is or when other inevitable ambiguities are present.
It would be desirable to provide a method and apparatus for “geo-locating” pages automatically whether or not the page authors provided suitable indications of location.