1. Field of the Invention
This invention pertains in general to computer security and in particular to categorizing web sites on the Internet to provide security and/or for other purposes.
2. Description of the Related Art
Modern security software monitors client web browsing in order to provide security for the user of the client and/or the enterprise at which the client is located. The security software can perform security actions based on the content of a browsed site. For example, the security software can block a user from visiting sites containing sexually explicit, job hunting, or gambling content.
Obviously, in order to provide this form of security the software must know the type of content provided by the sites that the user visits. In other words, to block sexually explicit web sites the security software must recognize that a given site provides sexually explicit content. To this end, some security software categorizes web sites based on the types of content provided by the web sites. Thus, a sexually explicit web site is categorized as “sexually explicit” and a gambling site is categorized as “gambling.”
Due to the large number of potential web sites and constantly-changing nature of the Internet, however, it is extremely difficult and expensive to maintain an up-to-date web site categorization database. Generating such a database requires both web spidering/crawling capabilities as well as human and machine learning-based categorization technologies. Plus, web sites are increasingly using obfuscation techniques to thwart machine learning-based categorization. One obfuscation technique employed by web sites is displaying only images and omitting all text from a web page to prevent keyword matching-based categorization.
As a result, providers of security software often must rely on extremely expensive manual web site inspection and categorization. Accordingly, there is a need for a more efficient and cost-effective way to categorize the content provided by web sites.