1. Field of the Invention
The present invention relates to the field of data processing. More specifically, the present invention relates to automated methods and systems for determining a rating for a rating scale for a collection of documents.
2. Background Information
The World Wide Web (WWW) is an expanding collection of textual and non-textual material which is available for access to any Internet user, from any location at any time. Some users find particular contents to be objectionable. For example, parents often wish to shield their children from exposure to sexually explicit material, hate speech, and drug information. Similarly, companies may wish to prevent access by employees to web sites that provide or support gambling.
Notwithstanding the civil liberty implications associated with these concerns, a number of groups and companies have brought forward systems and techniques for assisting Internet users in block accessing to undesired content. For example, various blocking software products are available from software vendors, such as SafeSurf of Newbury Park, Calif., and NetNanny of Bellevue, Wash. Typically, these products employ site lists to effectuate blocking of access to undesired contents. These site lists include the identifications of the web sites containing undesired contents. Access to any of the web pages hosted by the identified web sites is blocked. Another example of such a system is described by Neilsen et al., “Selective downloading of file types contained in hypertext documents transmitted in a computer controlled network”, U.S. Pat. No. 6,098,102, which utilizes the file extensions of URLs to determine whether the particular files will or will not be downloaded to the user. Still another method for controlling access to web sites is typified by the work of the Internet Content Rating Association, which uses the technology of the Platform for Internet Content Selection (PICS) specification to allow voluntary, or in the future potentially mandatory, rating of page content by the content author. Filtering can then be done by utilizing these rating “tags”, and may be augmented by a complete block on other un-rated pages.
These prior art approaches suffer from at least the following disadvantages:                a) The WWW is constantly growing. The number of web sites and their contents are constantly changing. As a result, the prior art approaches are unable to keep pace with the changes.        b) Further, many web sites generate user-specific pages at every access. As a result, the prior art URL based approaches are unable to facilitate blocking of these dynamically generated pages if they contain undesired contents.        c) Additionally, content providers are often not the best, or even the appropriate, agent for rating their own contents. Duplicitous providers may deliberately mis-rate the appropriateness of their contents.        
Some filtering systems rely on key word lists or text analysis, to judge the content of individual pages. While these systems may work satisfactorily on text files, they are ineffective for non-text materials, such as images, sound files, or movies.
Thus, an improved approach for blocking undesired contents is desired.