1. Field of the Invention
This invention relates generally to the Internet and the World Wide Web (web) and, more particularly, to methods for determining the relatedness of web sites and web pages.
2. Description of the Related Art
The Internet has become a convenient and popular medium for the exchange of information and the transacting of commerce. An Internet user has access to a vast number of web sites on the World Wide Web and this number is increasing at a rapid pace. For any one web site (hereinafter xe2x80x9csubject web sitexe2x80x9d), there frequently exist numerous other related web sites that offer the same or similar information, services, or products.
A number of systems and methods are known for identifying links to web sites, or pages thereof, that are related to a web site or other item of interest (hereinafter xe2x80x9crelated linksxe2x80x9d). One method for identifying related links involves examining link structures of web pages to identify relationships between particular pages or sites. For example, a relationship between web sites A and B may be deemed to exist if a significant portion of the web pages having links to A also have links to B. Another method involves performing a text-based analysis of web pages to identify pages or sites with similar content.
The related links identified through these or other methods may be presented to the user through a special client program, which may be a browser plug-in, that displays metadata for the web site or page currently being viewed. The client program typically retrieves the metadata on a URL-by-URL basis from a metadata server. A client program and associated service that operate in this manner are commercially available from Alexa Internet, the assignee of the present application. The related links may also be displayed in other contexts, such as in conjunction with search results from an Internet search engine.
One limitation associated with using link structures and textual content of web pages to identify related links is that large numbers of web pages generally must be retrieved and parsed in order to obtain satisfactory results. Another limitation is that the breadth of the related links data is typically dependent upon the ability of a web crawling program to locate web pages. Because of these limitations, the resulting related links may be based on only a small percentage of existing web pages. The present invention seeks to overcome these limitations while providing an additional measure of the relatedness of web sites.
The present invention provides a method for generating related links from the web usage trails of a population of users. Each usage trail is preferably in the general form of a sequence of URLs or domain names accessed by a user during a browsing session. The usage trails are preferably collected from users of a special client application of the type described above, but may additionally or alternatively be obtained from another source of usage trail data such as an ISP (Internet Service Provider). The method may be used independently, or may be used in combination with other methods (such as those mentioned above) to improve the breadth and reliability of the related links data.
In the preferred embodiment, the relatedness of two web sites or pages A and B is determined using a sensitivity calculation, which is preferably a minimum sensitivity calculation. The sensitivity calculation takes into consideration the number of transitions between A and B relative to the total number of transitions that involve A and/or B within a set of usage trail data. More specifically, the minimum sensitivity between A and B for a set of usage trail data is preferably determined by dividing the number of transitions between A and B by the greater of (i) the total number of transitions between A and all web sites and (ii) the total number of transitions between B and all web sites. In one embodiment, only one-step (direct) transitions between A and B are incorporated into the calculation. In other embodiments, transitions that involve more than one step may be recognized. Additional information extracted from the usage trails, such as the time spent between transitions and the user actions performed at a particular site, may optionally be taken into consideration within the sensitivity calculations.
The sensitivity calculations may, for example, be performed for all pairs of web sites reflected within the usage trail data, or for all pairs of web sites that co-occur within at least one usage trail. For each subject site or page A, the other web sites or pages X for which the minimum sensitivity score, MS (A, X), is highest are deemed to be the most closely related to A. Links to such other pages or sites are thus stored as the related links for A.