1. Field
Embodiments of the present invention relate to a document extraction technique and, in particular, to a spam blog extraction technique.
2. Description of Related Art
Utilization of blogs is rapidly spreading in recent years. Thus, some persons perform activities of introducing products or the like in their blogs and participating in affiliate programs of the distributors of the products or the like so as to gain an income. Accordingly, a large number of spam blogs appear in order to acquire as many accesses as possible and promote the sales of the products or the like. Some spam blogs are generated by completely copying or partly modifying other blog articles. Alternatively, as shown in Related Art FIG. 1, spam blog articles are generated by an automatic generation tool on the basis of a proper noun list which is a list of proper nouns that attract attention and a phrase list. Since these blog articles aim at merely acquiring accesses, they are, in many cases, articles in which proper nouns that attract attention are merely scattered and the meaning of which is unrecognizable grammatically.
On the other hand, from another point of view, for the purpose of marketing, a technique of analyzing the contents of blog articles and thereby extracting a consumer trend and the like has been developed.
An approach to automatic extraction of spam blogs is, for example, to determine a similarity on the basis of the degree that proper nouns, which have been extracted from a plurality of articles in a blog A that has been confirmed as a spam blog and which are have been adopted as a reference, are contained in a plurality of articles in a blog B serving as a judgment target. The reason why a plurality of articles need be processed is that if a single article alone were processed, a similarity could not appropriately be calculated because of variation in the proper nouns.
Here, as a technique relevant to this technique, Japanese Laid-Open Patent Publication No. 2001-282837 discloses a technique for efficiently and accurately collecting sites alone that have strong relevance to a particular field. Specifically, a document network is a network of document groups in which documents in various fields are arranged in a distributed manner. A key word data storage section stores keywords contained in the documents in a particular site. A keyword analysis device analyzes the degree that the documents in an arbitrary site of the document network contain the keywords stored in the key word data storage section. Then, on the basis of the analysis result of the keyword analysis device, a field judging unit judges whether the arbitrary site is a site in a particular field.
Similarly, as a relevant technique, Japanese Laid-Open Patent Publication No. 2004-280569 discloses a technique for efficiently extracting sites that have a large amount of information agreeing with a purpose of investigation. Specifically, this system comprises: a crawler section for patrolling and collecting Web documents among the documents on the Internet and outputting the documents and the document URLs having been collected; a first degree-of-rumor calculation section for extracting rumor expressions set up in advance from each document collected by the above-mentioned crawler section, then calculating the degree of rumor of each document on the basis of evaluation values corresponding to the extracted rumor expressions, and then outputting the result; a first site extraction section for extracting a site URL to which each document belongs, from the document URL outputted from the above-mentioned crawler section; a first site feature calculation section for outputting a site feature indicating the contents feature of the site specified by the above-mentioned site URL, and thereby storing the site URL and the site feature in a correspondence manner to each other into a site management table; a site selection section for extracting from the above-mentioned site management table a site feature B of a site URL specified by a system user; a document search section for searching a document on the Internet on the basis of an inputted search condition, and then outputting document information that contains a document URL and an update date as a search result; a new URL extraction section for referring to a URL management table that stores document information for each document URL, then outputting as a new URL a document URL which is not registered in the above-mentioned URL management table and a document URL whose document information is updated among the document URLs outputted from the above-mentioned document search section, and thereby registering the document information of the new URL into the above-mentioned URL management table; a download section for acquiring the document of the above-mentioned new URL from the Internet; a second degree-of-rumor calculation section for extracting rumor expressions set up in advance from each document acquired by the above-mentioned download section, then calculating the degree of rumor of each document on the basis of evaluation values corresponding to the extracted rumor expressions, and then outputting the result; a second site extraction section for extracting from the above-mentioned new URL a new site URL to which each document belongs; a second site feature calculation section for outputting a site feature A indicating the contents feature of the site specified by the above-mentioned new site URL; and a similarity site extraction section for calculating a similarity between the above-mentioned site feature A and the above-mentioned site feature B, then outputting a new site URL having a similarity greater than or equal to a reference value, and then recording the document information of the new site URL into the site management table.
Further, in spam blogs, articles are automatically generated using presently popular keywords. For that matter, ordinary blogs are also frequently generated using presently popular keywords.