Crawling and retrieval of web content can include browsing the World Wide Web in a methodical and/or orderly fashion to create a copy of visited pages for later processing by a search engine. However, due to the current size of the Web, search engines cannot index the entire Web.
Prior approaches to crawling and retrieving web content include the use of focused web crawlers. A focused web crawler estimates a probability of a visited page being relevant to a focus topic and retrieves a link corresponding to the page only if a target probability is reached; however, a focus web crawler may not retrieve a sufficient number of links or sufficiently relevant links. For example, a focus web crawler can download only a fraction of Web pages visited.