1. Field Of The Invention
The present invention relates to a computer based system and method for filtering data received by a computer system and, in particular, to a computer based system and method for filtering text data from World Wide Web pages received by a computer system connected to the Internet.
2. Prior Art
While there are numerous benefits which accrue from the interconnection of computers or computer systems in a network, such an interconnection presents certain problems as well.
Broadly speaking, networks allow the various computers connected to the network to share information. Typically, if desired, access to certain information is restricted by providing access codes or the like to those individuals who are cleared to view or download the information. While this method of controlling access to information works fairly well for situations where each user is identifiable, it is very difficult to efficiently and effectively implement such a method in cases where there are a large number of unidentifiable users. Such is the situation with the vast interconnection of networks called the Internet.
The Internet is accessed by many millions of users every day and while it is somewhat possible to obtain some information with respect to identifying the computers through which a particular user accesses the Internet, it very difficult, if not impossible, to identify a particular user beyond any self-identification provided by the user himself.
By far, most of the traffic on the Internet currently occurs on the World Wide Web. On the World Wide Web, both text and graphic information is typically provided on web pages and this information is transmitted via the Hyper Text Transfer Protocol (xe2x80x9cHTTPxe2x80x9d). A web page has a particular address associated with it called a Uniform Resource Locator (xe2x80x9cURLxe2x80x9d).
A typical user accesses the World Wide Web via a modem connection to a proxy/cache server which is connected to the Internet. A browser is the software program which runs on the user""s computer (client computer) and allows the user to view web pages. To view a particular web page, the user inputs the URL of the desired web page into his or her browser. The browser sends the request to the proxy/cache server and the server sends the request over the Internet to the computer on which the web page resides. A header as well as a copy of the body of the web page is then sent back to the user""s browser and displayed on the user""s computer.
While an incredible amount of information is available on the millions of web pages provided on the World Wide Web, some of this information is not appropriate for all users. In particular, although children can be exposed to a vast number of educational and entertaining web pages, many other web pages include adult content which is not appropriate for access by children.
One method which is used to control access to these adult web pages is to require an access code to view or download particular web pages. Typically, this access code is obtained by providing some identification, often in the form of a credit card number. The obvious drawbacks of this method are: 1) such a system will invariably deny or inhibit access to many adults as well as children because many adults do not want to, or may not be able to, provide a credit card number; and 2) the system is not fool-proof because children may obtain access to credit cards, whether their""s or their parents"".
Several services are available to parents and educators which provide a second method for preventing access to web pages containing adult content. These services provide software programs which contain a list of forbidden URLs. Service providers compile the list by searching the World Wide Web for web pages having objectionable material. When a user inputs a URL which appears on the forbidden list or xe2x80x9cdeny list,xe2x80x9d the program causes a message to be displayed indicating that access to that web page is forbidden. Although this method works well for denying access to web pages which are on the forbidden list, because thousands of web pages are being created and changed every day, it is simply impossible to provide an up-to-date list of every web page containing adult content. Therefore, these systems often allow children access to web pages which contain adult content but have not yet been added to the forbidden list.
A further drawback to the above-described access control systems is that they are simple admit/deny systems. That is, the user is either allowed to download and view the entire web page or he/she is forbidden from doing so. It is not practical, using either of these methods, to allow a particular user to download and view only the portions of the web page which are not objectionable.
The present invention overcomes the disadvantages of the prior art by providing a system and method for restricting access to objectionable or xe2x80x9ctargetxe2x80x9d data received by a computer over a network by filtering objectionable data from the data received. The present invention provides for filtering the data as received, so called xe2x80x9con the fly,xe2x80x9d so that a newly created web page may be filtered as accurately as one that has been predetermined to contain objectionable material. Because the present invention operates on successive blocks of data (of varying sizes) it can allow portions of the requested page to be displayed as they are received and reviewed, rather than requiring that the entire page be reviewed before being displayed. Thus, the user does not typically perceive a delay when the invention is in use.
Although the embodiments of the invention are described below with respect to a system and method for filtering objectionable data from the data received, it should be understood that the present invention can be applied to process any type of target data from the data received. Thus, the present invention may be utilized to process desired data such that, for instance, only Web pages containing desired information are displayed on the user""s computer.
In a preferred embodiment, the present invention provides a computer based method for filtering text data from World Wide Web pages which are received by a computer system connected to the Internet. Advantageously, the method of the present invention is carried out by the computer which acts as the proxy/server through which the user""s computer is connected to the Internet. However, the method can be carried out by the user""s computer as well.
According to the method, if the web page requested by the user contains only a minimum of objectionable or target data, the user receives a portion of the filtered web page for downloading and viewing on his or her computer. While, if the web page requested contains a large amount of objectionable material, the invention will cause a xe2x80x9cforbiddenxe2x80x9d page to be displayed on the user""s computer monitor.
In the preferred embodiment, the request is sequentially filtered at three different levels, if necessary. First, the URL requested is filtered to determine if the web page associated with that URL has been pre-approved or pre-denied. If the URL has not be pre-approved or pre-denied, the header of the web page is then filtered to determine if the web page contains text data (such as HTML). If so, the body of the web page is filtered. While the filter will decide whether or not to block access to the entire web page based on the URL, depending on its processing of the body of the web page, the filter may deny access completely to the web page, deny access to certain portions of the web page (i.e., filter out some objectionable words), or allow complete access to the web page.
The method of the present invention first compares the requested URL to xe2x80x9callowed listsxe2x80x9d which contain URLs of web pages which have been approved for display to the user in particular categories. If the requested URL is found in the allowed lists, the entire associated web page is, accordingly, forwarded to the user for downloading or viewing. If, however, the requested URL is not found in the allowed lists, the requested URL is then compared to xe2x80x9cdenied lists,xe2x80x9d (or xe2x80x9cforbidden listsxe2x80x9d) each of which functions in much the same manner as that of the prior art systems. If the requested URL is found in a forbidden list, a message is transmitted to the user""s computer indicating that access to the web page is forbidden (hereinafter referred to as a xe2x80x9cFORBIDDENxe2x80x9d page).
If the requested URL is found in neither an allowed list or a denied list, and if the header indicates that the page contains text data, then the method provides for filtering the text of the web page, as it is either received from the network or read out of cache, to determine if it contains objectionable or target text. If the page contains objectionable text, the method determines what kind of objectionable text (specific words or phrases), how much objectionable text, and the relative groupings of objectionable text.
Depending on the settings of predetermined parameters, certain objectionable words or phrases (if found) are either replaced with an innocuous filler (such as xe2x80x9c- - - xe2x80x9d) before the web page is forwarded to the user""s computer or only a xe2x80x9cFORBIDDENxe2x80x9d page is forwarded to the user""s computer or, a xe2x80x9cFORBIDDENxe2x80x9d message is forwarded to the user""s computer for a portion of the web page. The settings of the predetermined parameters may be modified by those having access to the computer program through which the computer implements the program, such as the server operator or, perhaps, the user""s parent. Advantageously, the requested URL may first be compared with the denied lists and then compared with the allowed lists. Optionally, the HTTP header of the web page is filtered after the URL to determine if the page contains text data and, if not, the method does not filter the web page body, since the method for filtering the web page body is only capable of filtering text or other recognizable data patterns. The method provides for filtering the text of the web page by comparing each xe2x80x9cwordxe2x80x9d (defined by groupings of letter/number characters) in the web page to a xe2x80x9cdictionary.xe2x80x9d The words in the dictionary are periodically updated. The invention is also capable of scanning for and filtering multi-word phrases (e.g., xe2x80x9cadult showsxe2x80x9d), but in preferred embodiments, such phrases are handled differently than single objectionable words.
Advantageously, each word in the dictionary has a number of variables associated with it, such as: 1) a variable that indicates whether the word, if found, should be replaced with the innocuous filler (or a specific replacement filler word may be indicated); 2) a variable that indicates what category of objectionableness the word belongs to (i.e., pornography, intolerance, crime, job hunting, etc.); 3) a variable that indicates what language the word is a part of (i.e., english, french, spanish, etc.); 4) a base score variable that indicates how objectionable the word is; and 5) a bonus score variable that indicates whether the word is more objectionable when used in combination with other objectionable words. In this advantageous embodiment, the method provides for filtering the body of the web page by comparing each word in the web page with the words in the dictionary. If a word in the web page matches, then that word will either be replaced or not replaced with the filler, as indicated by the variable. A running score is determined for the entire web page (or block), based on a particular algorithm, as the page is being filtered. If the final score for the page or block of text is above a predetermined threshold score, a xe2x80x9cFORBIDDENxe2x80x9d message is forwarded to the user""s computer instead. Thus, a user may see a portion of a requested web page followed by a message indicating that the next block of text is xe2x80x9cFORBIDDEN.xe2x80x9d
The system of the present invention comprises a general purpose computer which is programmed to implement the method of the present invention. The computer of such a system is typically the proxy/cache server computer but it may also be the client computer or another computer.
While the preferred embodiment of the present invention is described in summary form above as applied to the filtering of web pages received over the Internet, it will be appreciated by one of ordinary skill in the art that this method is also applicable to filtering of data received by a computer from any network, including an intranet. Further, in addition to reviewing HTTP information received from the Internet, the present invention may be applied to review information posted to forms-based pages such as search engines, surveys, guest books, and the like. If the words or phrases would yield objectionable results, the invention will prevent posting of the data to the remote HTTP server.
Other objects, features, and advantages of the present invention will be set forth in, or will become apparent from, the detailed description of the preferred embodiments of the invention which follows.