Given malicious websites and inappropriate and unwanted contents on the Internet, URL filtering is important for safe and efficient use of the Internet. URL filtering may be based on URL rating, which typically involves a rating server receiving URLs from clients and providing ratings (or categories) of the URLs for clients.
FIG. 1A shows a simplified block diagram of a typical URL rating scheme. As shown in FIG. 1A, client 102 sends a URL in a rating request to rating server 104 through the Internet, and in return rating server 104 sends a rating to client 102 in real time or near real time. In some cases, a URL may include multiple contents (such as presented in frames) and require multiple ratings. Given the popularity of Internet usage, if a rating request were serviced by rating server 104 whenever client 102 wishes to access a webpage, the number of rating requests serviced on a given day may be quite large, necessitating large communication and processing bandwidth on the part of rating server 104 (or multiple servers, as may be the case).
In order to reduce the number of rating requests sent through the network and processed by rating server 104, client 102 may employ a local cache 106 for temporarily storing complete URLs (or their hash values) of previously accessed web pages, along with their corresponding ratings. Thus, if a web page has been rated once by rating server 104, a subsequent access request by client 102 would result in a local cache hit, negating the need to send the URL to rating server 104 again to obtain a rating.
FIG. 1B shows a schematic representation of local cache 106. As shown in FIG. 1B, local cache 106 stores an exemplary URL “http://www.springfieldgazzette.com/articles/20060502.html.” with its corresponding rating “News”. There may be many more URLs and corresponding ratings stored in cache 106, of which the previously exemplary URL is only representative. For the cache to be useful in substantially reducing the number of rating requests sent through the network, a sizable cache that stores a sufficiently large number of frequently accessed URLs is desirable. This is because a cache hit requires that the URL and corresponding rating of the desired web page be locally cached. Such an arrangement, however, tends to result in an unduly high storage capacity requirement and inefficient use of the data storage device of client 102.
On the rating server side, techniques are also applied to rating server 104 to reduce the storage and processing requirements for servicing URLs sent by clients. For example, instead of processing the full URL (e.g., “http://www.springfieldgazzette.com/articles/20060502.html”) when received, server 104 may employ domain-based rating or directory-based rating in servicing the rating request.
FIG. 1C shows an illustrative example of prior art domain-based rating. As shown in FIG. 1C, rating server 104 processes only the domain portion of the full URL (e.g., only the “http://www.springfieldgazzette.com” portion of the full URL “http://www.springfieldgazzette.com/articles/20060502 html”) and provides the rating “news” to client 102. Domain-based rating is employed if it is known (or decided or designated) by rating server 104 that all contents of the domain “springfieldgazzette.com” are related to news, and the “news” rating can be applied to all URLs that associated with that domain. However, domain-based rating compromises the accuracy of rating, since there might be exceptions (e.g., categories or ratings other than “news”) or even malicious contents (such as phishing contents) in web pages of a given domain.
Directory-based rating provides more granular rating than domain-based rating. With directory-based rating, rating server 104 processes a URL not only by its domain, but also up to its longest directory path (or to a desired directory level in the directory level tree). The rating is then applied to all sub-directories or files under that directory. FIGS. 1D and 1E show illustrative examples of prior art directory rating. In FIG. 1D, a full URL “http://www.lagazzette.com/articles/Julyrainfall.html” is processed only up to its directory portion (e.g., only the portion “http://www.lagazzette.com/articles/”) to derive a rating of “news,” In other words, URLs accessing files and sub-directories under “http://www.lagazzette.com/articles/” are given a rating of “news”. As another example, in FIG. 1E, a full URL “http://www.lagazzette.com/crossword/July122006.html” is processed only up to its directory portion (e.g., only the portion “http://www.lagazzette.com/crossword/”) to derive a rating of “entertainment.” in other words, URLs accessing files and sub-directories under “http://www.lagazzette.com/crossword/” are given a rating of “entertainment”. However, the higher accuracy of directory rating conies at the cost of higher storage capacity and processing power on the part of rating server 104.
In light of the above, there is a need in the art for a method or apparatus that provides URL rating without comprising efficient use of network bandwidth, data storage, and processing power.