The present invention pertains the methods for scanning and analyzing various kinds of digital information content, including information contained in web pages, email and other types of digital datasets, including mulit-media datasets, for detecting specific types of content. As one example, the present invention can be embodied in software for use in conjunction with web browsing software to enable parents and guardians to exercise control over what web pages can be downloaded and viewed by their children.
Users of the World-Wide Web (xe2x80x9cWebxe2x80x9d) have discovered the benefits of simple, low-cost global access to a vast and exponentially growing repository of information, on a huge range of topics. Though the Web is also a delivery medium for interactive computerized applications (such as online airline travel booking systems), a major part of its function is the delivery of information in response to a user""s inquiries and ad-hoc explorationxe2x80x94a process known popularly as xe2x80x9csurfing the Web.xe2x80x9d
The content delivered via the Web is logically and semantically organized as xe2x80x9cpagesxe2x80x9dxe2x80x94autonomous collections of data delivered as a package upon request. Web pages typically use the HTML language as a core syntax, though other delivery syntaxes are available.
Web pages consist of a regular structure, delineated by alphanumeric commands in HTML, plus potentially included media elements (pictures, movies, sound files, Java programs, etc.). Media elements are usually technically difficult or time-consuming to analyze.
Pages were originally grouped and structured on Web sites for publication; recently, other form of digital data, such as computer system file directors, have also been made accessible to Web browsing software on both a local and shared basis.
Another discrete organization of information which is analogous to the Web page is an individual email document. The present invention can be applied to analyzing email content as explained later.
The participants in the Web delivery system can be categorized as publishers, who use server software and hardware systems to provide interactive Web pages, and end-users, who use web-browsing client software to access this information. The Internet, tying together computer systems worldwide via interconnected international data networks, enables a global population of the latter to access information made available by the former. In the case of information stored on a local computer system, the publisher and end-user may clearly be the same personxe2x80x94but given shared use of computing resources, this is not always so.
The technologies originally developed for the Web are also being increasingly applied to the local context of the personal computer environment, with Web-browsing software capable of viewing and operating on local files. This patent application is primarily focused on the Web-based environment, but also envisions the applicability of many of the petitioners"" techniques to information bound to the desktop context.
End-users of the Web can easily access many dozens of pages during a single session. Following links from search engines, or from serendipitous clicking of the Web links typically bound within Web pages by their authors, users cannot anticipate what information they will next be seeing.
The data encountered by end-users surfing the Web takes many forms. Many parents are concerned about the risk of their children encountering pornographic material online. Such material is widespread. Other forms of content available over the Web create similar concern, including racist material and hate-mongering, information about terrorism and terrorist techniques, promotion of illicit drugs, and so forth. Some users may not be concerned about protecting their children, but rather simply wish themselves not to be inadvertently exposed to offensive content. Other persons have managerial or custodial responsibility for the material accessed or retrieved by others, such as employees; liability concerns often arise from such access.
In view of the foregoing background, one object of the present invention is to enable parents or guardians to exercise some control over the web page content displayed to their children.
Another object of the invention is to provide for automatic screening of web pages or other digital content.
A further object of the invention is to provide for automatic blocking of web pages that likely include pornographic or other offensive content.
A more general object of the invention is to characterize a specific category of information content by example, and then to efficiently and accurately identify instances of that category within a real-time datastream.
A further object of the invention is to support filtering, classifying, tracking and other applications based on real-time identification of instances of particular selected categories of contentxe2x80x94with or without displaying that content.
The invention is useful for a variety of applications, including but not limited to blocking digital content, especially world-wide web pages, from being displayed when the content is unsuitable or potentially harmful to the user, or for any other reason that one might want to identify particular web pages based on their content.
According to one aspect of the invention, a method for controlling access to potentially offensive or harmful web pages includes the following steps: First, in conjunction with a web browser client program executing on a digital computer, examining a downloaded web page before the web page is displayed to the user. This examining step includes identifying and analyzing the web page natural language content relative to a predetermined database of wordsxe2x80x94or more broadly regular expressionsxe2x80x94to form a rating. The database or xe2x80x9cweighting listxe2x80x9d includes a list of expressions previously associated with potentially offensive or harmful web pages, for example pornographic pages, and the database includes a relative weighting assigned to each word in the list for use in forming the rating.
The next step is comparing the rating of the downloaded web page to a predetermined threshold rating. The threshold rating can be by default, or can be selected, for example based on the age or maturity of the user, or other xe2x80x9ccategorizationxe2x80x9d of the user, as indicated by a parent or other administrator. If the rating indicates that the downloaded web page is more likely to be offensive or harmful than a web page having the threshold rating, the method calls for blocking the downloaded web page from being displayed to the user. In a presently preferred embodiment, if the downloaded web page is blocked, the method further calls for displaying an alternative web page to the user. The alternative web page can be generated or selected responsive to a predetermined categorization of the user like the threshold rating. The alternative web page displayed preferably includes an indication of the reason that the downloaded web page was blocked, and it can also include one or more links to other web pages selected as age-appropriate in view of the categorization of the user. User login and password procedures are used to establish the appropriate protection settings.
Of course the invention is fully applicable to digital records or datasets other than web pages, for example files, directories and email messages. Screening pornographic web pages is described to illustrate the invention and it reflects a commercially available embodiment of the invention.
Another aspect of the invention is a computer program. It includes first means for identifying natural language textual portions of a web page and forming a list of words or other regular expressions that appear in the web page; a database of predetermined words that are associated with the selected characteristic; second means for querying the database to determine which of the list of words has a match in the database; third means for acquiring a corresponding weight from the database for each such word having a match in the database so as to form a weighted set of terms; and fourth means for calculating a rating for the web page responsive to the weighted set of terms, the calculating means including means for determining and taking into account a total number of natural language words that appear in the identified natural language textual portions of the web page.
As alluded to above, statistical analysis of a web page according to the invention requires a database or attribute set, compiled from words that appear in know xe2x80x9cbadxe2x80x9dxe2x80x94e.g. pornographic, hate-mongering, racist, terrorist, etc.xe2x80x94web pages. The appearance of such words in a downloaded page under examination does not necessarily indicate that the page is xe2x80x9cbadxe2x80x9d, but it increases the probability that such is the case. The statistical analysis requires a xe2x80x9cweightingxe2x80x9d be provided for each word or phrase in a word list. The weightings are relative to some neutral value so the absolute values are unimportant. Preferably, positive weightings are assigned to words or phrases that are more likely to (or even uniquely) appear in the selected type of page such as a pornographic page, while negative weightings are assigned to words or phrases that appear in non-pornographic pages. Thus, when the weightings are summed in calculating a rating of a page, the higher the value the more likely the page meets the selected criterion. If the rating exceeds a selected threshold, the page can be blocked.
A further aspect of the invention is directed to building a database or target attribute set. Briefly, a set of xe2x80x9ctraining datasetsxe2x80x9d such as web pages are analyzed to form a list of regular expressions. Pages selected as xe2x80x9cgoodxe2x80x9d (non-pornographic, for example) and pages selected as xe2x80x9cbadxe2x80x9d (pornographic) are analyzed, and rate of occurrence data is statistically analyzed to identify the expressions (e.g natural language words or phrases) that are helpful in discriminating the content to be recognized. These expressions form the target attribute set.
Then, a neural network approach is used to assign weightings to each of the listed expressions. This process uses the experience of thousands of examples, like web pages, which are manually designated simply as xe2x80x9cyesxe2x80x9d or xe2x80x9cnoxe2x80x9d as further explained later.
Additional objects and advantages of this invention will be apparent from the following detailed description of preferred embodiments thereof which proceeds with reference to the accompanying drawings.