The present invention relates to an incorrect hyperlink detecting apparatus and a method of the same, and relates more particularly to an incorrect hyperlink detecting apparatus for detecting a semantic inconsistency of a hyperlink provided to an HTML (Hyper Text Markup Language) file, and a method of the same.
A hyperlink (Hereinafter, referred simply to as “link.”) is provided among a large number of documents described in an HTML form (Hereinafter, referred also to as “HTML file.”) in WWW (World Wide Web), in order to provide the link, a file name or an anchor name of a link destination (URL: Uniform Resource Locator) is embedded in a document of a link source. When the link is provided correctly, a web browser will access an HTML file of the link destination in response to a click operation to the link text to thereby display the document.
When the link is provided incorrectly, however, an error will be displayed, or a completely unrelated document will be displayed. The former is called a “logical inconsistency”, and occurs when a file name or an anchor name, which is not present from the beginning, or was originally present but disappeared afterward, is embedded. Meanwhile, the latter is called a “semantic inconsistency”, and occurs when a file name, which is actually present but semantically incorrect, is embedded. Tools which can automatically detect the logical inconsistency of the link have been commonly widely provided, but tools which can automatically detect the semantic inconsistency of the link have not been provided yet. The following ideas, however, have already been proposed.
Japanese Unexamined Patent Publication (Kokai) No. 2004-220193 (Patent Document 1 below) discloses an HTML link examination system, which can examine whether or not an actual HTML site satisfies a site configuration with a link of an HTML file, intended by an implementer of the HTML site, and easily verify whether or not a link from an object that is particularly provided for the movement from one URL to another URL is correctly provided (refer to [Object] in [Abstract]). This system includes site configuration management means for managing in advance the relation through the link among the HTML files in the web, regarding the website which is composed of a plurality of HTML files created with the hypertext language; image link management means for managing related information on the HTML files of the link source and the link destination, regarding the link provided by a predetermined object utilized in the website; link information extracting means for extracting link information in the HTML; and link examination means for examining whether or not a link provided by an image meets the configuration managed by the site configuration management means (refer to [Solution] in [Abstract]).
In this system, however, in order to examine the link, the site configuration with the link of the HTML file which is intended by the implementer of the HTML site must be registered in advance.
Additionally, Japanese Unexamined Patent Publication (Kokai) No. 2004-139304 (Patent Document 2 below) discloses a hypertext test apparatus, which is applied to a hypertext database, and automatically finds and corrects a logically inconsistent (corresponding to the “semantic inconsistency” as used in the present invention) link portion and a correction candidate for it (refer to [Object] in [Abstract]). Information collecting means collects information on a page and a link which configure a hypertext, from the hypertext database to then store it in an information storage unit. Condition determining means groups pieces of link information for every item with reference to the information storage unit, and extracts a unique link out of the group as a link inconsistency. Candidate calculating means calculates a correction candidate which makes the link information of the unique link extracted by the condition determining means to be the same link information as that of the group. Correction reflecting means updates the hypertext database on the basis of the portion of the link inconsistency detected by the condition determining means and the correction candidate calculated by the candidate calculating means (refer to [Solution] in [Abstract]). Condition determining means extracts, from the information storage unit, a link in which a word included in a link source description is not included in a title, a header, and a highlighted character string in a link destination document, and gives a mismatch score thereto (refer to paragraph [0095]). In addition, the condition determining means divides the link source description of the link stored in the information storage unit into words. As how to divide the link source description into words, there are methods of using a morpheme analysis, dividing it where a character type changes, dividing if for every n-characters, or the like (refer to paragraph [0134]).
This apparatus checks only in one direction from the link source to the link destination, and judges it incorrect only when a word included in the link source description is not included in a link destination description. For that reason, for example, when the link source description is “People Finder portrait configuration”, and the link destination description is “People/finder configuration mode”, it cannot judge to be incorrect. It is because the word of “configuration” included in the link source description is included also in the link destination description. Moreover, although this apparatus is applicable to a massive target with a many-to-many link relation, it is inapplicable to a small-scale target with only a one-to-one link relation, as it is impossible to group pieces of information on the link source or the link destination.
Moreover, Japanese Unexamined Patent Publication (Kokai) No. 2005-173671 (Patent Document 3 below) discloses a link diagnostic system, which automatically detects the logical inconsistency (corresponding to the “semantic inconsistency” as used in the present invention) of the link of the hypertext (refer to [Object] in [Abstract]). The link feature extracting means extracts, as link feature information indicating likeness of logical (semantic) inconsistency of the link, (A) link feature information that can be obtained by the link itself, (B) link feature information that can be obtained on the basis of a relation between the link, and document data of the link destination, or (C) link feature information that can be obtained when the links are grouped according to a predetermined condition, inconsistency learning means seeks for a discriminant function by statistically calculating a relation between a content of each of link feature information and a rate of links judged to be inconsistent, inconsistency determination means determines whether or not the link of a determination target is inconsistent, using link feature information of an undetermined link, and the discriminant function calculated by the inconsistency learning means (refer to [Solution] in [Abstract]). This apparatus also has a problem similar to that of the aforementioned hypertext test apparatus.