1. Field of the Invention
Exemplary embodiments of the present invention relate to a method for detecting an original document of a web document, and more particularly, to a method for detecting an original document from several copied on-line documents.
2. Discussion of the Background
Various services using the Internet are provided with the development and propagation of the Internet, and a search service is a representative example of these services. The search service refers to a service in which, when a user inputs a word or combination of words to be searched as a query, search result documents corresponding to the query inputted to a search engine are provided to the user. Such search result documents are classified into categories such as Dictionary, Information, Blog, Cafe, Specialized Data, Cite, Book, Webpage, Moving Picture, and the like, and the classified categories are provided to users.
Recently, there has been an increase in search result documents, which are obtained by copying documents made by other users or posting the copied documents on users own blog or cafe as opposed to originally creating documents with respect to a specific theme. This is because documents used on the Internet can be easily copied. As a result, a user makes a document by identically reproducing an original document from a newspaper article or specialized data or another user's blog or cafe or by selectively copying a desired part of the original document. Such a copied document is identical to the original document or is substantially identical to the same document. Therefore, a plurality of copied documents that are identical or substantially identical to the original document may exist in the search result documents. In this case, when the search ranking of the copied documents are ranked above that of the original document instead of ranked below that of the original document, the exact search results are not provided to users.
To solve such a problem, there exist several methods for determining an original document from the original document and copied documents. However, since the copied documents are identical or substantially identical to the original document, in practice it is difficult to determine the original document. In addition, a document having the earliest distributed time of a web document is generally determined as an original document. However, when the distributed time is manipulated, it is more difficult to determine the original document.