1. Field of Invention
The present invention relates to a filtering method, and more particularly to a method for filtering out identical or similar documents from a plurality of documents and clustering the documents by using a computer.
2. Related Art
An Internet searching engine is a tool that helps a user to quickly search the vast Internet for data.
Generally speaking, the searching engine presents all results matching a searched keyword to the user, and presents all web pages without performing any filtering operation even if the web pages have identical contents. Although a few searching engines filter the search results, highly similar web pages still appear repeatedly.
Published PRC Patent No. CN101093485A has disclosed a “Method for filtering out repeated contents on web page”, including a file server, a web page content extraction server, a web page filtering server, and a crawler server. The method includes: a) the crawler server fetches data from a web page and transmits the data to the web page content extraction server for analysis; b) the web page content extraction server extract contents and generates hash codes by using a hash algorithm, and then stores the hash codes, the contents, fetching time, and other information into the file server; and c) the web page filtering server analyzes the information in the file server, calculates the number of conflicts in each website where the hash codes obtained in the step b) conflict, and sets a threshold for the number of conflicts and the number of web pages in the website. If the number of conflicts in a website and the number of web pages in the website are higher than the threshold, the web page filtering server directly notifies the crawler server to prohibit the website and filters off all contents of the web page. If the number of conflicts in a website and the number of web pages in the website are lower than the threshold and the data is fetched at an early time, the importance of the web page is increased; otherwise, the importance of the web page is lowered or the web page is filtered off.