Accurately classifying websites assists with many web-related tasks. For example, a prominent task is to roughly understand the content of websites, e.g., which broad topic is covered by a blog, which website is an online store, which website provides information on a specific topic, etc.
Standard text classification approaches require sets of labeled, representative examples for websites. These approaches are too intensive for many applications to execute efficiently, and for dynamically changing topics (like “top news”) it is difficult to keep classifiers up to date. Moreover, many web pages cannot well be analyzed with traditional text classification techniques because of the proliferation of video news and flash animations.