Field
Embodiments of the present invention generally relate to the field of network security techniques. In particular, various embodiments relate to classifying documents by hybrid classification engines.
Description of the Related Art
Web pages/sites may belong to different categories such as sports, news, entertainment, business, pornography, hate speech and the like, depending on the content/services being offered. As there are millions of web domains that include different types of content, some of such domains may include desired content, while some other may include content that is undesirable for different types of users. Such undesired web domains therefore are typically classified, and a list of restricted web domains, which may be included in a blacklist, for example, is compiled so as to help network security devices/applications filter or block such traffic and/or inform a network administrator/user about the type of content that the requested web page and/or web domain contains.
Existing security devices/applications generally include a list of websites that need to be blocked depending on network settings and/or the profile of the user who attempts to access the websites. For example, if a child attempts to access a pornographic website, the security device/application may block access to the adult content website to prevent access by the child. Similarly, if someone tries to access similar objectionable content from office premises, such access can be blocked/denied by the security device/application. It is also possible that, for the same web domain, access is allowed for one user (for example, an adult), but not allowed for another user (for example, a child).
Existing security devices/applications also typically maintain a reference table that includes a list of websites that are classified in different categories, and refer to one or more policy rules to decide whether access to a particular website should be allowed to a particular user. Compilation of such a list is a tedious and time consuming task, wherein the network administrator either has to manually provide a list of restricted websites or the security device/application needs to expend valuable computing resources to classify observed websites into different categories to determine whether access to a particular website should be given.
Web page classification, also commonly referred to as web page categorization or web domain classification, is a process of classifying web pages and/or a web domains and/or Uniform Resource Locators (URLs) into different meaningful categories. Prior art solutions provide different classification approaches for classifying a web domain or a web page in different categories based on the content of the web page. A naïve Bayes classifier is a web content classification based on Bayes' theorem with strong independence assumptions between the terms. For example, a term vector for adult website (Category Pornography) can be obtained from a training set of category Pornography. A naïve Bayes classifier can classify web pages/sites that contain enough text content with high accuracy. Bigger training sets and vocabulary can further improve performance of the naïve Bayes classifier. However, for rich media web pages with limited text content and which include mostly images, videos and/or contents that are dynamically generated by script languages, e.g., JavaScript and PHP, the accuracy of a naïve Bayes classifier is not so good.
Therefore, there exists a need for systems and methods for classifying web pages/sites by a hybrid classification engine with a naïve Bayes classifier and a sublink classifier.