With the development of Internet technologies, the Internet has become an important source for people to acquire various information. However, much information on the Internet is duplicated. Among billions or tens of billions of web pages, there are a large amount of web pages with duplicated information, which makes information processing difficult. Therefore, it is very important to remove duplicated web pages.
A current method may include removing duplicated web pages by selecting feature codes in web pages and comparing the feature codes. The existing method for removing duplicated web pages by means of feature codes of the web pages may include firstly selecting a period, a mark showing the end of a sentence, in a first web page as a locating point, and selecting a certain number of characters (e.g., Chinese characters or English characters) at two sides of the locating point as a feature code. The method may also include acquiring another feature code in a second web page by the same steps. The method may further include comparing the feature codes of the two web pages. If the feature codes of the two web pages are the same, the method may include determining that the second web page is a duplicated web page, and discarding the duplicated second web page. If the two feature codes are different, the method may include determining that the two web pages are different. In other words, the second web page is not a duplicated one from the first web page.
A potential problem of the existing method for removing duplicated web pages based on the feature codes is that it may make a wrong decision for two web pages with the same feature code but different contents. For example, a first web page may include a poem with several dozen characters. A user may incorporate certain content of the first web page into a second web page, and explain the poem in hundreds of characters according to his understanding. The explanation may not include any period. If the method for removing a duplicated web page is merely based on feature codes, these two web pages may be determined to be the same web page. However, the two web pages are different web pages. Therefore, accuracy of the above method for removing duplicated web pages may not be high. In addition, the feature codes extracted in the above method may be inaccurate. For example, the user may add a period in a caption or an edit for the included web page. When the feature codes are extracted according to the existing method, the feature codes of the original web page and the web page incorporating forwarded text are different. As a result, the original web page and the web page incorporating forwarded text may be determined to be different web pages. However, the texts of the original web page and the web page incorporating forwarded text may be the same.