User created content (UCC) is becoming an important data resource on the Internet. One popular type of user created content, directed towards user discussions, is referred to as a web forum (also named a bulletin board or discussion board). The data of a web forum are becoming very valuable for various web applications. For example, commercial search engines have begun to integrate forum data into their searches to improve the quality of search results. As another example, recent research efforts have tried to mine forum data to obtain useful information, such as business intelligence and expertise. In any such application, a general goal is to fetch data pages from various forum sites distributed over the Internet.
To download forum data effectively and efficiently, the characteristics of forums need to be understood, which involves understanding the forum pages and relationships between pages. Forum pages tend to be semi-structured, and are typically generated based upon pre-defined templates.
As a result of the structuring, the pages of a given forum site may be classified into several categories, in which each category represents a specific function. For example, generic forums usually have list-of-board pages, post-of-thread pages, user profile pages, and so forth; to extract post-of-thread content, identification of the post-of-thread pages is required.
Once classified, page classification may be used in forum page understanding, and for further analysis of forum data. Page classification is also valuable in forum crawling, e.g., page classification is a component used in recovering the structure of the forum site, and determine an optimized route for a crawler. Further, page classification can help filter out invalid pages and reduce duplicate pages; for example, the same pages (or other content) having different Uniform Resource Locators (URLs) are often generated for different requests, such as “view by date” or “view by title” requests.
To categorize forum pages, URL pattern analysis may be used, particularly with respect to sites hosted by commercial forum service providers. For example, “*/forumdisplay.php?fid=*” refers to list-of-post pages, while “*/viewthread.php?tid=*” refers to post-of-thread pages. However, in many cases, a URL is ambiguous and does not reveal a page's function. As one example, professional forums and communities of large enterprises usually define their own forms, whereby, for example, a URL such as “http://www.wxyz-forums.net/” provides no readily apparent URL patterns indicative of different types of pages.
Another technique used in categorizing forum pages utilizes Document Object Model (DOM) tree-based structure criterions to describe target pages. However, using DOM trees in forum sites for categorization does not provide a sufficient and robust solution, as similar pages may have different numbers of advertisements, images, and even complex sub-structures from user posts.