The present invention relates to a system and method for sampling web page addresses by performing a random walk, so that a near-uniform sample is obtained.
Documents on interconnected computer networks are typically stored on numerous host computers that are connected over the networks. For example, so-called xe2x80x9cweb pagesxe2x80x9d may be stored on the global computer network known as the Internet, which includes the world wide web. Web pages can also be stored on Intranets, which are typically private networks maintained by corporations, government entities, and other groups. Each web page, whether on the world wide web or an Intranet, has a distinct address called its uniform resource locator (URL), which at least in part identifies the location or host computer of the web page. Many of the documents on Intranets and the world wide web are written in standard document description languages (e.g., HTML, XML). Theses languages allow an author of a document to create hypertext links to other documents. Hypertext links allow a reader of a web page to quickly move to another web page by clicking on the links. These links are typically highlighted in the original web page. A web page containing hypertext links to other web pages generally refers to those pages by their URL""s. Links in a web page may refer to web pages that are stored in the same or different host computers.
It is often desirable to obtain statistical information about web pages on the world wide web, including information about the characteristics of the URL""s of the web pages as well as information about the characteristics of the documents referred to by the URL""s. Examples of characteristics of the URL""s for which statistical information may be gathered include length of the URL""s, number of arcs in the URL""s, port numbers, file name extensions, and top level Internet domains. Examples of characteristics of the documents for which statistical information may be gathered include length, character set used, language, number of outbound links, number of embedded images, and percentage that are directed to a particular interest (e.g., political, sports, business) or activity (e.g., e-commerce).
Referring to FIG. 1A, a list containing a randomly selected set of URL""s may be obtained by performing a random walk of the web (i.e., the world wide web, or the web on an Intranet). Starting with a seed set of URL""s, the random walk engine downloads a URL selected at random from the seed set. Any outgoing links are extracted from the downloaded URL. The random walk engine selects an outgoing link, if any, at random to become the current URL. If there is no outgoing link in the downloaded URL, a seed is chosen at random from the seed set. Each downloaded page not in the seed set is added to the seed set. The list of randomly selected URL""s may be formed by selecting the URL""s of the visited pages, or the URL""s of visited pages plus all URL""s referenced by the visited pages, or by randomly selecting a subset of the pages from either of the aforementioned sets of URL""s.
Referring to FIG. 1B, another way to obtain a random sample of URL""s is to randomly select one or more search terms from a lexicon, perform a search engine query using the selected search terms, and then randomly select one or more URL""s from the search results. The selected URL or URL""s are added to the list of randomly selected URL""s. This process may then be repeated until a suitably sized list of randomly selected URL""s has been formed.
While the random sampling procedures described above result in a list of random URL""s, the list is biased toward well connected URL""s. Referring to FIG. 2, there is shown a small portion of a hypothetical set 50 of interlinked pages 51-65. As can be seen, some pages have only one inbound link, while others have much larger numbers of inbound links. The URL for a page that is referred to by many pages is more likely to be visited during a random walk, and also more likely to be indexed by a search engine than a URL that is referred to by few pages. Therefore the list generated by the aforementioned procedures is not uniformly representative of the URL""s (or pages) in the set of reachable pages.
The present invention is a system and method for generating a list of near-uniform samples of data sets (e.g., web pages) from among a plurality of host computers. The system performs a random walk so as to generate a set of visited addresses, sometimes called a set of randomly selected addresses, wherein each address in the set corresponds to a data set. For each address in the set of visited addresses, a reachability measure is computed. Then, samples are selected from the set of visited addresses, such that the probability of selecting a given address is inversely proportional to the reachability measure for the address. The selected samples form the list of near-uniform samples.
In an exemplary embodiment, the set of visited addresses is generated by selecting a current address uniformly at random from a seed set, downloading a data set using the current address, and adding the current address along with the outbound links in the corresponding data set to the set of visited addresses. If the data set contains no outbound links, another address is selected uniformly at random from the seed set to become the new current address. Otherwise, a new current address is selected by computing a uniformly random real value r. When r is less than a predetermined value D, an address is selected uniformly at random from the seed set to become the new address, and otherwise an address is selected uniformly at random from the outbound links of the downloaded data set to become the new current address.
In an exemplary embodiment, the reachability measure for each respective address may be set equal to a visited ratio, comprising a ratio of the number of visits to the respective address to the total number of pages visited during the random walk of the data sets. Alternately, the reachability measure for each respective address may be set equal to the page rank of the data set at the address, where the page rank is an estimate, computed using a predefined page rank function, of what fraction of visits in an infinitely long random walk of reachable data sets at the plurality of hosts would be to the data set at the respective address.
In an exemplary embodiment, the sampling of addresses in the set of visited addresses is accomplished by computing a cumulative probability density function using the reachability measure, and then, for each sample to be included in the list of near-uniform samples by selecting a random value of the cumulative probability density function, and adding to the list the address (from the set of visited addresses) that corresponds to the random value of the cumulative probability density function.