A web crawler (also referred to as a bot, spider, ant, etc.) is typically configured to browse the web in a methodical, automated manner with the goal of acquiring information about content posted to a webpage. The information it acquires may vary depending upon the intended purpose of the information. For example, crawlers that are used in conjunction with search engines may be designed to make a copy of the webpage being crawled so that the search engine may index the webpage. Crawlers may also be used for maintenance checks of a webpage (e.g., making sure links in the page work properly) and/or to collect email addresses or other hyperlinks listed on a webpage.
Unfortunately, the web has more webpages than a crawler is capable of crawling. Additionally, webpages are altered so quickly (e.g., as new content is posted on existing webpages, content is deleted from existing webpages, new webpages are created, etc.) that it is difficult for crawlers to know which webpages to crawl. Therefore, web crawlers are programmed with a list of uniform resource locators (URLs) that identify which webpages the crawler is supposed to crawl. When a webpage from the list is crawled, the URL is added to a second list that contains all of the URLs that have been crawled (e.g., so that the webpage is not crawled again before other webpages are crawled). Additional URLs can be added to the list based upon the content of a crawled webpage (e.g., a URL on a crawled webpage identifies another webpage). The web crawler continuously cross checks the two lists so that the same webpage is not crawled twice. Once all of the webpages on the list have been crawled, the crawler erases the second list and repeats the process.
While current web crawlers are effective, it may be hours if not days before a crawler revisits a webpage. Content that is only relevant for a short period of time, such as content from social media webpages (e.g., social networking sites, blogs, microblogs, etc.), may not be crawled while it is relevant. Therefore, a web crawler that can crawl a large number of webpages while maintaining a low latency between the time of publication on the webpage and the time of collection would be useful.