Today, various content filtering mechanisms are provided to entities to manage and/or control user access to the Internet via facilities provided by the entities. For example, a company typically implements some forms of content filtering mechanisms to control the use of the company's computers and/or servers to access the Internet. Access to content within certain predetermined categories using the company's computers and/or servers may not be allowed during some predetermined periods of time.
A typical content filtering client, which typically resides in a firewall, sends a request for the content rating of a web page in response to each web page browsed. The content rating requests are routed to a separate content rating server. When the content rating server receives a request, the content rating server retrieves the content rating for that request from a database and sends the content rating to the content filtering client.
Based on the content rating retrieved, the content filtering client determines whether the user is allowed to access the web page. If the user is allowed, the content filtering client passes the web page. Otherwise, the content filtering client blocks the web page.
To build the database of content ratings (hereinafter referred to as ratings) used in content filtering, one conventional way is to have a number of workers manually browsing a number of web pages (e.g., Web crawling) to evaluate the content of the web pages. Then the workers assign a content rating to each web page evaluated. The content ratings are stored in the database. Although this type of rating is generally highly accurate, it takes a long time to rate web pages manually. Furthermore, the problem is worsened due to the large number of web pages available over the Internet.
Another existing way to rate web pages is to scan the text in the web pages for keywords or key phrases and evaluate the web pages based on the presence or absence of certain keywords or key phrases. This mechanism can be automated using servers or computing devices well known in the art, and hence, is faster than the manual evaluation discussed above. However, this mechanism suffers from lower accuracy in the ratings resulted. For example, this mechanism may classify a web page having the keyword “breast” in the pornography category. However, medical web pages discussing breast cancer may also be inadvertently classified as pornography because of the use of the word “breast.” Furthermore, content rating is typically limited to the content within a document or Web page.