This specification relates to identifying affiliated domains.
A Uniform Resource Locator (URL) is a string of characters that identifies a resource (e.g., an addressable web document or file) on a computer network. A URL provides a means for locating a resource by describing the resource's location on the network. Each URL includes a hostname. A hostname is a unique name by which a network naming system identifies a particular device or group of devices that are attached to the network. Hostnames are used by naming systems for various networks (e.g., the Internet or local area networks) to identify devices that are attached to the network.
The network naming system used by the Internet, the Domain Name System (DNS), associates each hostname with a particular Internet Protocol (IP) address. An IP address can be associated with one or more distinct hostnames. For example, the DNS can map the different hostnames www.domain1.com and www.domain2.com to the same IP address. In this case, if a user inputs either hostname, by, e.g., entering the names into a web browser, the user will be routed to the same network location—the location identified by the single IP address.
In this specification, the term “domain” will be used to refer to those Internet resources that are addressable through URLs sharing the same hostname. A domain may include a very large number of resources and IP addresses, or it may include only a few resources and a single IP address. Under this definition, a domain will always be identified using its hostname: the hostname www.random.com, for example, will be used to indicate the collection of resources addressable through that hostname.
Each hostname ends in a top-level domain name. The top-level domain name can be, for example, a generic top-level domain name, e.g., .com or .gov. Alternatively, the top-level domain name can be a country code top-level domain (“ccTLD”) name, e.g., .fr or .ca, which identifies the country in which the name was registered. Hostnames also include a second-level domain name immediately to the left of the top-level domain name. The second-level domain name can indicate a particular organization that is associated with the content on the domain. For example, the hostname www.random.com may indicate that the content is associated with an organization named Random, Inc. Hostnames having the same second-level domain name but different top-level domain names may be unrelated: for example, www.random.be and www.random.com may well be associated with distinct organizations.
A number of hostnames can belong to the same organization. For example, an organization can register hostnames in different countries (i.e., with a ccTLD name) in addition to registering a non-geographic hostname. The organization Random, Inc. might decide to register domain names www.random.ca and www.random.co.uk (registrations in Canada and the United Kingdom, respectively) in addition to a hostname www.random.com. The organization's websites may include substantially similar content, despite being found under different hostnames.
Search results presented to a user (e.g., in response to a search query) can include results corresponding to different resources, found on different domains, that can be considered substitutes for each other. A typical user may consider the resources on different domains belonging to the same parent organization—for example, on www.random.co.uk and on www.random.ca—to be similar enough that search results from both are redundant. The presence of such repetitive results can obscure other, unique resources identified within the search results, detracting from the effectiveness of the search algorithm.