1. Field of the Invention
The present invention relates to methods and computer programs for selecting media candidates for advertising, and more particularly, methods, systems, and computer programs for sensitivity categorization of web pages.
2. Description of the Related Art
The computing industry has seen many advances in recent years, and such advances have produced a multitude of products and services. Internet websites are examples of products and services, which are created to give users access to particular types of services, data, or searching capabilities. Online content providers are increasingly moving towards building World Wide Web sites which are more reliant on dynamic, frequently-updated content. Content continues to be made available more and more via online auction sites, stock market information sites, news and weather sites, or any other such site whose information changes on a frequent basis, oftentimes daily.
Reputable advertisers do not want their ads associated with pages of sensitive nature. Many advertisers require that their ads be shown only on pages that do not have sensitive content, such as content related to adult themes, alcohol, illegal drugs, death and suffering news, etc. Advertising in such sensitive pages would have a negative impact on the image of advertisers and their products or services.
A system is required to identify when web pages contain sensitive material in order to allow advertisers to stop from advertising in these pages. However, given the enormous varieties of pages available on the web, it is practically impossible to categorize all pages in a precise manner, and some sensitive pages will be wrongly categorized as being non-sensitive. This can cause problems for the ad-placement companies because advertisers will be unhappy if their products are shown in these pages.
On the other hand, a categorization system can define very stringent criteria to avoid this problem. As a result, many pages that are not sensitive will be categorized as sensitive to improve the margin of error. This creates a problem for the ad-placement companies as their inventory of web pages is diminished.
The primary requirement for sensitivity categorization is to have models that achieve very high recall, i.e., retrieve as many of the sensitive pages as possible, with a reasonable precision. Two specific aspects of sensitivity categorization make it a hard problem. First, categorizing web pages is an inherently difficult task even for humans. Web pages typically have several facets and identifying a page as ‘sensitive’ is often subjective. Second, sensitive pages are rare. This implies a biased sampling and training process which might generalize poorly on real traffic.
It is in this context that embodiments of the invention arise.