This specification relates to search systems, and more particularly to processing the resource addresses of sites to facilitate information retrieval.
The Internet provides access to a wide variety of resources, examples of which include video or audio files, web pages for particular subjects, book articles, or news articles. A search engine can identify resources in response to a user query that includes one or more search terms or phrases. The search engine ranks the resources based on their relevance to the query and importance and provides search results that link to the identified resources. One example search engine is the Google™ search engine provided by Google Inc. of Mountain View, Calif., U.S.A.
A web site is one or more resources associated with a domain name, and one or more servers host each web site. Web sites are maintained publishers that manage and/or own the web sites. Often web sites include substantively duplicative or similar resources targeted to different groups of users. Examples of substantively duplicative or similar resources are resources in different languages, e.g., resources in a website that includes corresponding sets of web pages in English, French, German, Japanese, etc.; resources for different countries but in the same language, e.g., English-language pages for users in the United States, Australia, Germany, France, etc.; and user-agent specific pages for different types of user agents.
Often, however, the publisher does not explicitly identify the targeting of the resource, and the targeting cannot be reliably inferred from the resource locator alone. For example, a web site may have sets of resources with similar resource locators, such as:
au.example.com/ . . . /index.html
cn.example.com/ . . . /index.html
de.example.com/ . . . /index.html
or
www.example.com/a/ . . . /index.html
www.example.com/b/ . . . /index.html
www.example.com/c/ . . . /index.html
The resource locators in the first set of resource locators are similar in that they are identical except for the country code host names for the country codes of Australia, China and Germany. The resource locators in the second set of resource locators are similar except for the top level path directories a, b and c. For the first set of resource locators, the publisher may provide resources in the same language (e.g., English) and targeted to different countries. Alternatively, the publisher may provide language specific resources targeted to specific languages (e.g., English, Chinese, and German).
With respect to the second set of resources, the publisher may have created its own resource locator structure, the targeting purpose of which is not readily apparent. The top level path directories a, b and c may indicate a language targeting, a country targeting, a user agent targeting, or some other targeting or partitioning of resources based on one or more resource attributes.
Because the resources of the web site may be substantively duplicative or similar, the search results can include search results for the same domain and referencing similar or duplicative resources. A typical user may consider such search results to be redundant. The presence of such search results can obscure other, unique resources identified within the domain, and thus degrade the user experience.