The present disclosure relates to automatic crawling of web content, and more specifically, to automatic crawling of web content having encoded and dynamic URLs.
A web-crawler searches, or spiders, websites in an automated way to gather and analyze information for different purposes. The automatic navigation is based on identification of visited webpage, also referred to uniform resource locators (URLs), and discovering new ones. Many sites, in particular search engines, use automatic crawling, as a means of providing up-to-date data. Web-crawlers can also be used for automating maintenance tasks on websites, such as checking links or validating HTML code.
A web-crawler starts with a list of URLs to visit, generally called seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the pages and adds them to the list of URLs to visit. These URLs are recursively visited according to a set of policies. Web-crawlers are used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Web-crawlers often need to determine if a particular webpage, or URL, is unique. This information is used to determine if a subsequently visited webpage, or URL, being visited is new or a duplicate of one visited before.