The present disclosure relates generally to the field of a search engine in a computer network system, in particular to systems and methods of detecting duplicate documents in a web crawler system.
Search engines provide a powerful source of indexed documents from the Internet that can be rapidly scanned. However, as the number of documents in the Internet grows, it takes ever longer time periods between the time when a web page is crawled by a robot and the time that it can be indexed and made available to a search engine. Furthermore, it takes ever longer time periods to replace or update a page once it has been indexed. These latency problems have seriously affected the freshness of a search result provided by a search engine.
Meanwhile, it is becoming more and more common that there are many duplicate copies of a document sharing identical content, even though they may be physically stored at different web servers. On the one hand, these duplicate copies of document are welcome because they reduce the possibility that shutting a one web server will render the documents on the web server unavailable; but on the other hand, they can significantly increase the workload and lower the efficiency of a search engine on both its front end and back end, if not dealt with appropriately.
For example, on the back end of a search engine, if duplicate copies of a same document are treated as different documents not related with one another in terms of their content, this would cause the search engine to waste resources, such as disk space, memory, and/or network bandwidth, in order to process and manage the duplicate documents. On the front end, retaining duplicate documents would cause the search engine to have to search through large indices and to use more processing power to process queries. Also, a user's experience may suffer if diverse content that should be included in the search results is crowded out by duplicate documents.
For these reasons, it would be desirable to develop a system and method of detecting duplicate documents crawled by a search engine before the search engine makes any further effort to process these documents. It would also be desirable to manage these duplicate documents in an efficient manner such that the search engine can efficiently furnish the most appropriate and reliable content when responding to a query whose result set includes any of these duplicate documents.