As both the usage and size of the Internet has increased, the importance of providing fresh and relevant web content has also increased. Web crawlers are often used to crawl path identifiers, such as Uniform Resource Locators (URLs), to index and copy web content associated with the path identifiers. The web content associated with the path identifiers may then be processed by a search engine. A web crawler may crawl path identifiers according to a path identifier list. The path identifier list typically provides an order in which the path identifiers are to be crawled by the web crawler. After a page associated with a path identifier is crawled, the path identifier is often added to the end of the path identifier list. The web crawler may re-crawl the page when the web crawler reaches the end of the list to identify changes and new content.
Current systems order the path identifier list utilizing common ordering methods such as a “first in first out” (FIFO) or “last in first out” (LIFO) order. The web crawler may crawl each identifier based on the ordering method. While these methods provide an order for the path identifier list, they do not provide any priority to the listed path identifiers. As a result, path identifiers associated with web content deemed to be of higher importance and/or known to be regularly updated will be crawled at the same rate as other path identifiers in the path identifier list that may identify web content deemed to be less important and/or known to be rarely modified.
To provide prioritization of the path identifiers in the path identifier list, some systems order the path identifier list according to priority values. For example, each path identifier may be assigned a priority value between 1 and 100 and path identifiers may be selected from the path identifier list in an order corresponding to their assigned priority values. As a result, path identifiers assigned a higher priority value will always be crawled before path identifiers assigned a lower priority value. Unfortunately, these types of methods often lead to “starvation” of some of the listed path identifiers. For example, path identifiers assigned a low priority value may never be crawled by the web crawler due to path identifiers with a higher priority value continually being taking precedence and/or being added to the path identifier list.
As a result of these techniques and others, web crawlers are often limited in their ability.