Web search engines process huge queries across large web indexes built for billions of pages and petabytes of data (both of which may continue to grow at a rapid pace). Web search engines may also have tight latency constraints. Search engines may typically use three major components: Caches, index servers, and document servers. Large result caches may be used to store previously computed query results over billions of pages crawled by large clusters. As the crawlers are crawling these pages and retrieving page contents, there may be two major challenges. First, pages may change more frequently than a crawler can update a search engine. This may lead to outdated search results that may not be found when a user clicks on a result or may be different from the summary content displayed in a search result listing.
Second, a malicious website may be able to distinguish between types of requestors such as a search engine bot versus a user at a browser. Specifically, a malicious website may identify a search engine bot (or crawler) via User-Agent properties in an request, an IP range associated or other request attributes and may respond with a first set of content for a page that may be designed to achieve a high search engine ranking. Many automated attack kits may build lots of pages for top searched keywords, link them together and even build the content dynamically using other pages such as, for example, wiki entries, news entries, etc. These attack kits may be directed towards getting a page listed in the top search results for the most searched keywords to attract users. Search engine results may contain brief text along with the key subject of the page on the search results listing which may persuade a user that the page is genuine and relevant. Once a user clicks on a search result and the malicious site determines that the request is not from a crawler or search bot, the malicious site may deliver completely different content such as, for example, a fake anti-virus scan page or perform other social engineering attacks to trick the user to download malicious content or even perform “drive by download” type of attacks in which malware may be automatically downloaded. Web users may suffer from malware, misleading applications, and/or stale content.
In view of the foregoing, it may be understood that there may be significant problems and shortcomings associated with current search engine results management technologies.