1. Field
The present disclosure relates to a method, a device, and a computer readable recording medium to provide improved detection of similar documents.
2. Discussion of the Background
As use of the Internet is becoming increasingly common, users can obtain various information through an Internet search. That is, the users may input one or more identifiers (e.g., a Uniform Resource Locator (URL), an Internet Protocol (IP) address, and the like), to an address window of a web browser through terminal devices, which may have access to the Internet so as to access Internet search sites. The user may input search words to search and see corresponding search results related to various fields, such as news, knowledge, game, community, web documents, and the like. The terminal devices may include, without limitation, a personal computer (PC), a smart television (TV), a tablet, and other portable electronics with Internet access.
As such, in order to appropriately display contents that may be sought by the users, an Internet search site provider may generally design and configure a search engine to collect various web documents, to configure indexes in the collected web documents and to provide the searched results to users based on the configured indexes. In particular, a web crawler or a search engine (collectively referred to as a web crawler) may play a role of searching and collecting the web documents existing on the Internet using a systematic and automatic method.
As one of the operations of the web crawler, an operation of recognizing one or more hyperlinks referred to and/or included in a URL list, which may be referred to as a seed, to update a URL list and recursively visiting web documents corresponding to the updated URL list may be used.
However, among the web documents to be generally collected, the contents thereof may be similar or practically the same. Therefore, even if users may not intend or desire to separately collect the web documents, the users may visit and collect various web documents included in the URL list, including those that may be similar or the same, at the time of searching and collecting the web documents according to an operation of the web crawler. Capturing of duplicate web documents may contribute to a problem of wasted storage space in which the collected web documents are stored. Further, additional problems of degradation in performance and efficiency of the search engine, and the like may also be incurred.
To solve these problems, a technology of detecting similar web documents (hereinafter referred as simply similar documents) performs operations, such as deleting duplicated documents from the storage space if the web documents are determined to be similar documents, reducing the collection speed of the paths through which the corresponding documents may be found.
However, the technology of detecting similar documents may determine whether documents are similar to one another based on sizes of the respective documents. In this case, even though most portions of the web documents may be the same, the web documents may not be considered to be similar documents if a difference in sizes of the web documents is above a reference threshold. However, the web documents may be determined to be similar if the difference in sizes of the web documents is determined to be within or below the reference threshold, even if the web documents may actually be different. Accordingly, the technology of detecting similar documents may hinder a normal document collection operation and degrade search quality.
FIG. 1A and FIG. 1B are diagrams illustrating an example of the case in which web documents are determined as similar documents by a technology of detecting similar documents. A problem of the technology of detecting similar documents will be described with reference to FIG. 1A and FIG. 1B.
FIG. 1A and FIG. 1B illustrate web documents, which may have different URLs, that can be collected by a web crawler. Here, description information displayed in region A of FIG. 1A and region A′ of FIG. 1B may be determined to be core portions or portions of interest in the web documents. The description information may include, without limitation, brand names, product codes, and the like. Referring to FIG. 1A and FIG. 1B, although the core portions (regions A and A′) may be determined to be different from one another, the non-core portions (regions B and C) may be considered similar. In an example, non-core portions may include information, such as menus, options, product detailed information and the like.
However, if the non-core portions (regions B and C) occupy a large amount of size with respect to the whole web document, the technology of detecting similar documents may determine the web documents are similar documents if the difference in the sizes of the web documents is within or below the reference threshold. More specifically, even if the core portions or regions A and A′ are different, if the difference in sizes of the web documents is within or below the reference threshold, the web documents may be considered to be similar documents. Accordingly, the technology of detecting similar documents may determine different web documents as similar documents and not include them in the search results, thereby degrading the search quality.
The above information disclosed in this Background section is only for enhancement of understanding of the background of the invention and therefore it may contain information that does not form any part of the prior art nor what the prior art may suggest to a person of ordinary skill in the art.