When a search engine is used to search for desired information, there are a large number of links with duplicate content or even dead links in a result returned by the search engine, so that it is very time-consuming and inconvenient for a user to acquire the information. Because the number of internet websites is enormous, a workload of a crawler, which is one of core modules of a search engine, and a size of data that is to be read and written by the crawler are also unimaginable. If web pages with duplicate content can be eliminated in a fast and high-accuracy manner, it can not only avoid feeding back duplicate information to the user, but also save system resources for subsequent processing.
In the prior art, a hash calculation is performed on a main body of a candidate web page, a set of web pages with stored hash values is retrieved, and it is determined whether the number of same hash values exceeds a given threshold. If the number of same hash values exceeds the given threshold, the candidate web page is considered as a duplicate web page. However, this manner is low in accuracy. It can only determine a web page with all words unchanged as a duplicate web page, and de-duplication processing cannot be performed on a new web page that is formed by deleting or adding some sentences on a basis of an original web page.