This invention relates generally to methods for rating data for objectionable content. More particularly, it relates to methods for automatically rating and filtering objectionable data on Web pages.
The astronomical growth of the World Wide Web in the last decade has put a wide variety of information at the fingertips of anyone with access to a computer connected to the internet. In particular, parents and teachers have found the internet to be a rich educational tool for children, allowing them to conduct research that would in the past have either been impossible or taken far too long to be feasible. In addition to valuable information, however, children also have access to offensive or inappropriate information, including violence, pornography, and hate-motivated speech. Because the World Wide Web is inherently a forum for unrestricted content from any source, censoring material that some find objectionable is an unacceptable solution.
Voluntary user-based solutions have been developed for implementation with a Web browser on a client computer. The browser determines whether or not to display a document by applying a set of user-specified criteria. For example, the browser may have access to a list of excluded sites or included sites, provided by a commercial service or a parent or educator. Users can also choose to receive documents only through a Web proxy server, which compares the requested document with an exclusion or inclusion list before sending it to the client computer. Because new content is continually being added to the World Wide Web, however, it is virtually impossible to maintain a current list of inappropriate sites. Limiting the user to a list of included sites might be appropriate for corporate environments, but not for educational ones in which the internet is used for research purposes.
The Recreational Software Advisory Council (RSAC) has developed an objective content rating labeling system for Web sites, called RSAC on the Internet (RSACi). The system produces ratings tags that are compliant with the Platform for Internet Content Selection (PICS) tag system already in place, and that can easily be incorporated into existing HTML documents. The RSACi labels rate content on a scale of zero to four in four categories: violence, nudity, sex, and language. Current Web browsers are designed to read the RSACi tags and determine whether or not to display the document based on content levels the user sets for each of the four categories. The user can also set the browser not to display pages without a rating.
While a good beginning, there are three significant limitations to the RSACi rating system. First, it is a voluntary system and is effective only if widely implemented. There is somewhat of an incentive for the site creator to assign a rating, even if a zero rating, because some users choose not to display sites without a rating. If the site""s creator does not include a rating, it can be generated by an outside source. However, the rate at which content is being added to the Web makes it virtually impossible for a third party to rate every new Web site manually.
Second, while the RSACi rating aims to be objective, it is subject to some amount of discretion of the person doing the rating. At its Web site (http://www.rsac.org), RSAC provides a detailed questionnaire for providing the rating, but the user can easily override or adjust the results.
Finally, there is currently no way to rate dynamically created documents. For example, search engines receive a user query, find applicable documents, and create a search result page listing a number of the located documents. The search result page typically includes a title and short abstract or extract, along with the URL, for each retrieved document. The result page itself might have objectionable content, and currently the only way to address this problem is for browsers not to display search result pages at all. Without search engines, though, internet research is significantly limited.
A further problem with all of the above solutions, as well as with word-screening or phrasescreening systems, is that they either allow or deny access to Web pages. Even if only a small portion of the document is objectionable, the user is prohibited from seeing the entire document. This is especially significant in search result pages, in which one offensive site prevents display of all of other unrelated sites.
The situation becomes even more complex when Web pages include non-text data, for example, audio or images. Surrounding text does not always indicate the content of the embedded file, allowing offensive audio or image material to slip through the ratings system. Occasionally, people deliberately mislabel offensive audio or image files in order to mislead monitoring services.
There is a need, therefore, for an automatic rating method for all material available on the World Wide Web, including dynamically created material, that allows greater viewer control over what material is displayed or blocked.
Accordingly, it is a primary object of the present invention to provide a method for automatically rating a data file, for example, a Web page, for objectionable content.
It is an additional object of the invention to provide an objective rating method that requires no subjective human input after the system is initially devised.
It is a further object of the present invention to provide a method for automatically rating dynamically created documents as they are being created.
It is a yet another object of the present invention to provide a rating and filtering method that blocks objectionable content of a file while allowing access to remaining inoffensive portions of the file.
It is an additional object of the present invention to provide a method that can be used with any type of data file, including text, audio, and image.
It is a further object to provide a method for rating and filtering data files that can be implemented on a client, server, or proxy server, and can therefore be easily incorporated into existing system architectures.
Finally, it is an object of the present invention to provide an automatic rating method that works with existing manual rating methods and requires minimal system changes.
These objects and advantages are attained by a computer-implemented method for rating a raw data file for objectionable content. The method occurs in a distributed computer system and comprises the steps of preprocessing the raw data file to create semantic units representative of the semantic content of the raw data file, comparing the semantic units with a rating repository comprising semantic entries and corresponding ratings, assigning content rating vectors to the semantic units, and creating a modified data file incorporating rating information derived from the content rating vectors. After the modified data file is created, either all, some, or none of the file will be displayed by a browser to a user at a client computer.
The method works with any type of data file that can be converted to semantic units. Embodiments of the preprocessing step vary with the type of raw data file to be rated. In one embodiment, a text-only HTML document is stripped of its tags and is then parsed into semantic units, for example, words or phrases. In an alternate embodiment, the data file is an audio file, and text data is created from the audio file using standard voice recognition software. The system also creates an audio-to-text correlation between a location in the created text data and a corresponding location in the audio file. The text file is then parsed into semantic units. In a further embodiment, image processing software is used to identify semantic units within an image file. The semantic units of an image file are discrete objects in regions within the image file.
The rating repository used depends on the type of file and related semantic units. For text files, the repository contains entries of words or phrases with corresponding content rating vectors. Each word entry in the repository may have numerous associated content rating vectors for different contexts in which the word is used, determined by surrounding words in the text. Audio files use a similar rating repository, but may include additional entries for sounds. The entries for image files are discrete objects that can be identified by the image processing software. Each discrete object has one or more content rating vectors associated with it. To assign content rating vectors to semantic units, the system first searches the rating repository for an entry equivalent to the semantic unit. If it finds no such entry, it assigns the semantic unit a zero content rating vector. If it does find an entry, it assigns the semantic unit the entry""s corresponding content rating vector. If the entry has numerous content rating vectors, it analyzes surrounding semantic units to determine the appropriate context before assigning a content rating vector.
In a first preferred embodiment of the invention, a composite content rating vector, comprising a set of components, is derived from the content rating vectors. Each component of the composite content rating vector is derived from corresponding components of the content rating vectors. In one embodiment, each component of the composite content rating vector is a weighted average of the corresponding components of the content rating vectors, wherein the weighted average uses weighting factors related to the value of the components of the content rating vectors. In an alternate embodiment, each component of the composite content rating vector is equal to a selected value of the corresponding components of the content rating vectors. The selected value is the highest of the corresponding components and has at least a predetermined minimum number of occurrences. Many other methods for deriving the composite content rating vector can be used. The composite content rating vector is combined with the raw data file to produce a modified data file containing the composite content rating vector.
In a second preferred embodiment, termed filtering, the content rating vectors are compared with preset user limit values that define objectionable content rating vectors to identify objectionable semantic units. Objectionable content corresponding to the identified objectionable semantic units are then replaced by display blocks in a copy of the raw data file to produce a modified date file. Filtering can be performed on files including text, audio, or image. In a text-only data file, objectionable words or phrases are replaced with, for example, spaces, black rectangles, or a predetermined phrase. In an audio file, objectionable portions that correspond to the objectionable semantic units are located using the audio-to-text correlation. The objectionable portions are replaced with audio blanking signals, for example a tone or silent space, in a copy of the audio file to produce a modified audio file. Similarly, objectionable discrete objects of image files are identified by comparing content rating vectors with present user limit values. Content corresponding to the objectionable discrete object is replaced by image blocks, which may be black rectangles or blurred regions. In an alternate embodiment of the invention, after the objectionable content is replaced, the system derives a modified composite content rating vector for the modified data file from a modified set of content rating vectors. The modified set of content rating vectors does not contain content rating vectors corresponding to the objectionable semantic units.
The method can be implemented using many different architectures. In all architectures, the raw data file is stored in a server and the preset user limit values are stored in a client. All embodiments of the method can be implemented in a server, proxy server, or client. As is necessary, the server or proxy server obtains the preset user limit values from the client, and the proxy server and client obtain the raw data file from the server.