Web indexing typically occurs when search engines collect and order data from the web to facilitate efficient information retrieval. Through the use of an index, a search engine may avoid scanning each and every document in a corpus and instead rely on the index to fulfill search queries. Typically, a web crawler begins the process of web indexing by fetching web pages. There are several types of crawlers, including static crawlers, dynamic crawlers, and interactive crawlers, as further described herein.
Traditional link-based crawlers that access web pages through outlinks of seed uniform resource locators (URLs) with static content may not access web pages that exist in the deep or hidden Web. The pages that lie within the hidden Web are accessible only after they are created dynamically as a result of some input to a web page, usually a web user filling and submitting web forms. There may be few hyperlinks to the pages that are generated as a result of a user filling and submitting web forms. Further, among pages that are generated as a result of a user filling and submitting web forms, few of them have hyperlinks pointing to them from general seed URLs. The hidden web may also include pages that are accessible only through links produced by scripted content such as JavaScript, Flash, or AJAX.
In order to index the hidden Web, some search engines introduce algorithms that generate queries for input in forms on a web page. The queries may be constructed by analyzing the static content of the web page and extracting keywords. A common technique is based on term frequency—inverse document frequency (TFIDF). The queries may be limited to default values if default values for a particular input or control exist. In this manner, only a small number of input combinations on non-scripted forms generate URLs for inclusion into the web index. Moreover, these generated URLs may contain a large number of invalid combinations, while dependent controls on non-scripted forms and dependencies between various controls may be ignored, leading to a large number of invalid web pages. The URLs that are found to be valid may be excluded based on a lack of distinction or low informativeness.