1. Field of the Invention
This invention pertains in general to a web crawling selection policy and in particular to prioritization of web crawler queues based on referrer context information for a remote object link received via a network.
2. Description of the Related Art
Conventional web crawlers have very little useful information by which they can prioritize their target link queues for inspecting links, e.g., for a search engine. Web crawlers typically rely on either first in first out (FIFO) selection policies or prioritize their target link queues by inbound link popularity, i.e., the count of other sites and or objects that point to the given link. However, using these methods, web crawlers take a long time to get through the queue, putting them significantly behind human-based or self-spreading link distribution channels due to delays in re-indexing or investigating links that potentially should be prioritized higher relative to other links in the queue.
There are several disadvantages to conventional FIFO or inbound link-count prioritization of target link queues, e.g. by web crawlers associated with search engines. For a web search engine, if a link is very popular but is not highly prioritized for investigation, e.g., because it recently came into the queue, then the link will remain unindexed despite receiving lots of traffic.
Many malicious attacks on computer systems are received as remote object links in network traffic, such as email, instant messaging, or HTTP traffic associated with a web site. In the context of a threat scanner search engine, if a link is malicious but fairly new, then the link will have time to attack many different users' computers before being identified by the threat scanner.
Traditional web crawler selection policies lack access to referrer context information about remote objects associated with links received in network traffic. Referrer context information allows the entity that provided (or received) a link to be ascertained, as well as the protocol in which it was received and other aspects of the transmission. Referrer context information can be an important resource in identifying how a link moves from one client to another.