The present invention relates in general to the field of internet communication, and in particular to a method and a system for automatic genre determination of web content. Still more particularly, the present invention relates to a data processing program and a computer program product for automatic genre determination of web content.
The internet contains a huge number of web pages, and is growing at an astonishing rate. It is not uncommon for 150,000 new domains to be registered under a generic top level domain (TLD) in a single day. Search engines crawl the internet, and attempt to index and classify these web pages. Web page classification is generally based on content classification. Simply stated, the textual content is extracted from the web page and analyzed for the occurrence of words and phrases indicative of a content genre. For example, to ascertain whether a web page contains content regarded as illicit, the text content could be compared against a list of known illicit keywords. If the text contains sufficiently many illicit keywords, then it is regarded as being illicit in content. The keywords used to classify content can also be ascertained automatically in a so-called “supervised learning” environment, where corpora of content relating to genre and not-genre are presented to a learning algorithm, which then selects the statistically significant keywords for the given genre.
Keywords for a genre must not even be complete words. Using a technique known as “shingling” (or n-gram analysis), a “sliding window” of N-characters is passed over the text, and a token of length N is extracted from each window position in the text. The window is typically shifted by a single character. The computing resources required for n-gram analysis are typically higher than that required for traditional keyword extraction, as there are far more possibilities for token values. The technique is more robust, however, as it (for example) automatically takes into account word stems.
An increasing proportion of web content is, however, not reliably classifiable via the text content. This is because the text itself does not relate to one particular genre, but rather, contains a wide range of different topics and contexts. An example of such web content is a “web log” page, often referred to as a “blog”. If the blog author limits his writings to a single topic, for example ornithology, then the content may be assigned by automatic classification to a general category, e.g. “nature” or “birds”. If the blog author writes about current affairs, for example, or has many different topics which interest him, then the content will not be reliably classifiable.
The real problem, however, is that the “true” genre of the web page is “blog”, and this genre can never be ascertained from the textual content of the web page. Blogs are only one example of a “meta” genre which is not classifiable via its content. Other examples are forums, chat rooms, social media sites, internet discussion sites and so forth. In general, all sites which act as a container for user generated content fall into this genre.
The requirement for search engines to identify blog pages is current. For example, to include or exclude blog or social media content from search engine queries, the content, and therefore the web page, must first be labeled as belonging to such a genre.
In another scenario, a next-generation firewall product (NGFW) may wish to restrict access to, for example, chat rooms and social media sites. The ability to identify such web page genres in a “live” environment, or via a database of known URLs, would be very valuable.
Whilst there are many classification schemes for traditional web content, there are relatively few proposals to solve the problem of meta genres.
In the Patent Application Publication U.S. Pat. No. 7,565,350 B2 “IDENTIFYING A WEB PAGE AS BELONGING TO A BLOG” a method to identify blog pages is disclosed. The disclosed method is restricted to blog pages, and could not be used to identify the other “meta” genres mentioned previously.