A web crawler is a software application that can crawl the both the Internet and enterprise intranets. A unit of crawling is called a cruise, and includes a branching tree of paths where each node is a Web page. For each cruise, a web crawler can take a number of inputs, including certain initial or seed URLs, a maximum depth of nodes to be crawled, and/or a set of one or more regular expressions to which the crawled URLs must adhere.
FIG. 1 shows an example input 10 for a web crawler. During a cruise, the web crawler visits each of the seed URLs, parses the HTML content obtained from each URL for links contained in each URL. The user can override the general parser to locate or ignore specific types of links, such as advertising links. Next, the web crawler visits the contained URLs, iterating the process to make sure that no link is visited twice. The input 10 causes the web crawler to crawl the specified seed websites, i.e. to a specified depth of 100, and for all HTML, PPT, DOC, and JPG files, but cuts short a cruise branch at the CNN link. FIG. 2 shows a result 12 of a cruise according to input 10.
The results of a cruise by the web crawler can be used for various applications or scenarios. A text mining system, for example as implemented in the Text Retrieval and Extraction (TREX) component of the SAP NetWeaver technology platform, can index the HTML contents obtained from the links visited by the web crawler to enable a full text search over these contents and/or documents. The text mining system can also add attributes to the indexed pages and documents to enable a search over these attributes. For example, the attributes may be metadata provided with a document, such as author, title, and so on.
For these and other applications of web crawler results, it may be useful to know which seed URL led to a given content or document. However, for many Web pages, it is difficult to know from the page's URL which seed URL led to the page. For example, a URL may contain an IP addresses that has no association with the seed URL, or contain no path information to show the path by which the URL was accessed. Further, the number of URLs and documents crawled during a cruise can number in the millions, making a quick determination of the seed URL very difficult.