Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may generate a relevance score to indicate how relevant the information of the web page may be to the search request based on various metrics such as the term frequency and inverse document frequency metric (“tf*idf”). The search engine service may also generate an importance score to indicate the importance of the web page based on various metrics such as Google's PageRank metric. The search engine service then displays to the user links to those web pages in an order that is based on a ranking determined by their relevance and importance.
Some techniques for determining the relevance of a web page to a query factor in whether a query term matches a URL term of the URL of a web page. For example, if a query is “USPTO news,” then these techniques may indicate that the web page with the URL of “www.uspto.gov” and the web page with the URL of “www.uspto.gov/news” are more relevant to the query than a web page with the same content but with a URL that did not match a query term. The URL depth priors technique assigns different relevance probabilities based on the URL type. The URL types are ROOT, SUBROOT, PATH, and FILE. A ROOT URL contains only a domain name that is optionally followed by “index.html” (e.g., “www.uspto.gov/index.html”). A SUBROOT type contains only a domain name followed by a single directory that is optionally followed by “index.html” (e.g., “www.uspto.gov/news/index.html”). The PATH type contains a domain name followed by an arbitrarily deep path that is optionally followed by a file name that can only be “index.html” (e.g., “www.uspto.gov/news/2005” or “www.uspto.gov/news/2005/index.html” but not “www.uspto.gov/news/2005/archive.html”). The FILE type is any URL ending with a file name other than ‘index.html’ (e.g., “www.uspto.gov/news/2005/archive.html”).
The URL depth priors technique has achieved acceptable performance when the URL prior probability based on URL type is combined with content relevance for home page and named page searching. (TREC-2004 Web Track Guidelines, Jul. 16, 2004). Home page searching refers to a query submitted by a user when the user wants to find a home page. For example, a user may submit the query “US patent office” when searching for the home page “www.uspto.gov.” Named page searching refers to a query submitted by a user when the user wants to find a non-home page that identifies the name of the desired page rather than words describing its topic. For example, a user may submit the query “patent office news” when searching for the named page “www.uspto.gov/news.”
The URL depth priors technique, however, may not achieve acceptable performance for topic distillation searching. Topic distillation searching refers to a query submitted to find pages directed to a specific topic. For example, a user may submit the query “patent office 37 CFR revisions” when searching for web pages relating to recent changes to the Code of Federal Regulations affecting the U.S. Patent and Trademark Office.
Since it is difficult to identify whether a query is intended to be a home page, named page, or topic distillation query, the URL depth priors technique may not achieve acceptable performance when used by a general search engine.