The following relates to the web crawling arts, database crawling arts, database mining arts, and related arts.
Web crawling is a process by which indexes are constructed for Internet search engines. The technique operates recursively, by indexing a current web page, and then following the hyperlinks to other web pages contained in the current web page and indexing those linked web pages, and so forth. Web crawling thus leverages the existing hyperlinked superstructure of the Internet to efficiently index web pages. The indexing can be respective to various aspects of the web page. In topic focused web crawling, a particular subject of interest is indexed. This can be based, for example, on the occurrence of keywords related to the subject, and the web page can be assigned a score based on the count of occurrences or other suitable metric or metrics of relevant keywords. As the Internet is global in extent, there is also interest in indexing documents respective to language. For example, a user may want to limit the search results to the user's native language, or knowledge of the language in which a web page is written can be leveraged to perform machine translation of the web page into the user's native language.
Web crawling may be untargeted, i.e. every page is to be indexed respective to a wide range of aspects. Alternatively, in focused web crawling, the goal is to index respective to a particular aspect, that is, respective to the focus of the web crawling. As illustrative examples, focused web crawling may be used to locate web pages specifically related to climate change, or may be used to locate pages in a specific language, such as Hindi.
While web crawling is a common application, more generally crawling can be applied to any database containing documents hyperlinked to other documents in the database (and, optionally, to other documents outside the database). As another variant, web crawling can be constrained to a particular domain or portion of the Internet, e.g. to a particular wiki.