1. Field of the Invention
The present invention relates to method and apparatus for automatic information filtering by which inappropriate or harmful information such as pornographic images among various information provided through the Internet is identified and blocked from being presented.
2. Description of the Background Art
In conjunction with the rapid spread of the Internet, computers that had only been tools of limited specialists in the past are now being introduced even to ordinary homes and schools. For this reason, even ordinary persons who have no prior experiences with computers can now make accesses to the Internet very easily. With this background, the serious problem that arose in recent years is accesses by children to harmful information such as pornographic images that are abound on the Internet. In order to cope with this problem, in the United States, a bill called “communications decency act” which allows the governmental organization to censor information on the Internet was proposed, but failed to pass into law because the Supreme Court's decision that this act violates the constitution that guarantees the freedom of expression.
This has prompted much interests in the technique called “information filtering” recently. The information filtering is a technique by which harmfulness of an information on the Internet that a user attempts to access is checked at a time of the access, and the access to this information is blocked by some means when this information is judged as harmful.
Methods employed by currently commercially available harmful information filtering softwares can be largely classified into the following four.                (1) Filtering based on self-rating        (2) Filtering based on third party rating        (3) Automatic filtering        (4) Method using scores (points) assigned to words        
In the following each of these four methods will be briefly described.
First, in the filtering scheme based on self-rating, the WWW information provider himself rates harmfulness of the content provided by himself, and labels each HTML (HyperText Markup Language) file with the result of the self-rating. The filtering software refers to this self-rating result label of each HTML file and blocks access to the HTML file that is labelled as harmful. FIG. 1 shows an outline of this filtering scheme.
The filtering based on self-rating shown in FIG. 1 utilizes the standard for labelling the Internet contents called PICS (Platform for Internet Content Selection) that were created by the World Wide Web Consortium of the Massachusetts Institute of Technology. Using PICS, the content provider can easily label the information provided by himself and discloses such a label.
In many cases, the information provider wishing to disclose such rating results utilizes a service of the rating organization that provides rating results based on PICS. The most representatives of such rating organizations include Recreational Software Advisory Council (RSAC) and SafeSurf, each of which provides rating results based on its own independently created standard. The information provider labels a header of each HTML file according to the rating result obtained from such a rating organization. FIG. 2 shows an exemplary labelling based on the rating result.
This self-rating presently relies on the voluntary initiative by the contents provider. For this reason, it can be said that the effective harmful information filtering according to this scheme is impossible unless many contents providers show their willingness to utilize the ratings of this scheme.
Next, the filtering based on third party rating will be described. There are some developers of harmful information filtering software who adopt a scheme for rating harmfulness of home pages (web cites) on the WWW independently and setting the own rating results as the rating standard of the filtering software. In general, a list of URLs (Uniform Resource Locators) of harmful home pages are constructed as a result of this rating. This URL list is distributed to users along with the filtering software, and utilized as the rating standard of the filtering software. In many cases, the filtering software incorporates a mechanism for periodically downloading this harmful URL list. FIG. 3 shows an outline of the harmful information filtering based on third party rating.
A representative software having such a mechanism is CyberPatrol. The CyberPatrol has harmful URL list for each one of thirteen categories including “violence” and “sexuality”, and carries out the harmful information filtering according to these harmful URL lists.
The harmful URL list used in this scheme is created and updated by each software developer who actually accesses and rates each home page, so that it is impossible to deal with newly produced home pages or those home pages that have moved from the original URLs to different URLs. Consequently, it is presently impossible to deal with the filtering with respect to these home pages that are not targets of the rating.
Next, the automatic filtering will be described. There are some filtering softwares which check the content of the accessed home page and judge harmfulness of the accessed home page. Such an idea has already been introduced in early filtering softwares. For example, there had been a software which carries out the processing for prohibiting access to a URL that contains character strings such as “sex” or “xxx”. More specifically, the harmful information, i.e., words that may potentially contained in inappropriate information, is registered in advance, and whether such a registered word appears in the accessed information or not is checked, and then the presentation of this information is blocked in the case where the registered word is contained in that information. As a variation of this scheme, there is also a scheme which blocks the presentation of the information in the case where a rate by which the registered words are contained in the information exceeds a prescribed threshold.
Also, some softwares for verifying contents of the home pages have also been developed. One such software for carrying out the automatic filtering is CyberSITTER. This software realizes the filtering by a scheme in which the accessed page is outputted after removing harmful words contained in that page.
However, this scheme is associated with the following two problems. First, there is a problem regarding the processing time required in carrying out this automatic rating. In this type of processing, the required processing time is about several milliseconds, which is not terribly long, but there is an undeniably possibility that even such a short processing time may cause some frustration to users.
Another problem is the precision of the automatic filtering. In the case of adopting the rating algorithm which judges harmfulness in units of words, a possibility for blocking many harmless pages is high. In fact, there is a report of an undesirable incidence in which a home page related to a British town of “Sussex” is blocked. Moreover, in the case of carrying out the automatic filtering by paying attention only to text information within the page, there is also a problem that it is impossible to block a page on which only an image is displayed.
Next, the method using scores (points) assigned to words will be described. In this method, words that may potentially contained in inappropriate information and scores for these words are registered in advance, and whether such a registered word appears in the accessed information or not is checked. Then, the total score of the words that appear in that information is calculated and the presentation of this information is blocked in the case where the calculated total score exceeds a prescribed threshold.
However, in this method, the setting of the registered words and their scores are ad hoc, so that there has been no known policy regarding what kind of setting is most convenient to users. For this reason, there has been a problem related to its performance such that information that should have been blocked cannot be blocked or information that should not have been blocked is blocked.
For example, suppose that a phrase “high school girl” is registered with a score of 40 under the assumption that this phrase often appears in the pornographic information in general. Then, an expression “sample images of a high school girl for free” will have the total score of 40 as it contains a phrase “high school girl”. However, another expression “a car accident in Hokkaido involving high school girls on a bus” will also have the total score of 40, so that these two expressions have the same score. If the threshold is set at 20, there arises a problem that the latter expression will also be blocked even though there is no need to block this expression. On the other hand, if the threshold is set at 50, there arises another problem that the former expression will not be blocked even though it should be blocked. In order to distinguish these two expressions, it is necessary to set scores also to other words such as “sample”, “image”, “free”, “bus”, “Hokkaido”, “accident”, etc., but these are words that are often used ordinarily, so that what score should be set to such words is unclear. Moreover, the performance will be largely affected by the score setting, so that there arises a problem that the sufficient performance cannot be achieved in judging whether a given expression is inappropriate or not.
As described, the big goal of the automatic information filtering is to increase a rate for blocking inappropriate information while reducing a rate for blocking appropriate information by mistake. Referring to a rate of actually harmful pages among blocked pages as precision and a rate of blocked pages among actually harmful pages as recall, it can be said that the goal of the filtering software is to increase both precision and recall.
FIG. 4 shows a table indicating characteristics of three of the conventional methods described above. As can be seen from FIG. 4, each of the conventional schemes described above is superior to the others in some respects but inferior to the others in some other respects, so that there has been a problem that it is presently impossible to obtain a sufficient filtering performance by the conventional automatic information filtering.
Also, as described above, the conventional automatic filtering is carried out by paying attention only to text information within a page, so that there has been a problem that it is impossible to block a page that contains very few or no text information and only displays images.