Some content, such as, for example, web pages on the Internet, may not be appropriate for all users. Content may be categorized and filtered accordingly to reduce the availability of content inappropriate for users. Due to the growth of Internet traffic, lack of central management of web content, and the desire to prevent people, especially children, from seeing offensive or inappropriate materials on the Internet, web-filters have been developed to block access to certain objectionable web pages, such as pornographic web pages. Accurate web-filters rely on correct recognition of inappropriate web content, which involves a task of content categorization.
Manual and automated approaches to web content categorization may be used separately or in combination. In a manual approach, human analysts tag web pages with categories according to their content. The uniform resource locators (URLs) of these manually categorized web pages together with their category tags are stored in a database for future use. To categorize a web page, its URL is matched against the pre-categorized URLs in the database. If a match is found, the web page is classified in the category of the matched URL. Otherwise, the category of the web page is unknown.
An automated approach may apply machine learning techniques to create models of categories from the textual content of a set of pre-categorized training web pages. The learned models are then applied to dynamically classify web pages. An automated approach may complement a manual approach, and may be able to assign a category to every web page, and to handle new and dynamically-generated web pages.