Various attempts have been made to mine community information from web pages using data mining techniques. For example, “community mining” may identify web sites that share certain common characteristics wherein the identified web sites are the members of a community. Community mining techniques may model web data using a graph with vertices representing web pages or web sites and edges representing relationships between the web pages or web sites. Community mining techniques use different definitions of the characteristics of the member of a community. For example, one community mining technique defines a community as a set of web sites that has more links to members of the community than to non-members. That community mining technique may use a maximum flow/minimum cut approach to identify subgraphs that satisfy the definition. Another technique defines a community as a dense directed bipartite subgraph that contains a complete bipartite subgraph of a certain size. Another well-known technique for ranking web pages, the Hyperlink-Induced Topic Search (“HITS”) technique, defines a community as a set of authority web pages linked to by important hub web pages that share a common topic. In the area of social network analysis, one community has been defined as users who share common interests based on their electronic mail communications. Another community mining technique defines a community based on popularity of different types of objects calculated using a graph with vertices representing heterogeneous objects. Some community mining techniques have identified communities based on evolution of web data over time. These techniques compare the data at different time points using dynamic metrics such as growth rate, novelty, and stability.
One example of community mining is the HITS technique, which is based on the principle that web pages will have links to (i.e., “outgoing links”) important web pages. Thus, the importance of a web page is based on the number and importance of other web pages that link to that web page (i.e., “incoming links”). The HITS technique is additionally based on the principle that a web page that has many links to other important web pages may itself be important. Thus, HITS divides “importance” of web pages into two related attributes: “hub” and “authority.” Hub is measured by the “authority” score of the web pages that a web page links to, and “authority” is measured by the “hub” score of the web pages that link to the web page. The HITS technique calculates importance based on a set of web pages and other web pages that are related to the set of web pages by following incoming and outgoing links. The HITS technique submits a query to a search engine service and uses the web pages of the results as the initial set of web pages. The HITS technique adds to the set those web pages that are the destinations of incoming links and those web pages that are the sources of outgoing links of the web pages of the result. The HITS technique then calculates the authority and hub score of each web page using an iterative algorithm.
Typical community mining techniques use either dynamic web data or heterogeneous web data. Dynamic web data refers to the analysis of the evolution of web data as it changes over time. Heterogeneous web data refers to web data representing different types of objects. These community mining techniques, however, do not use both dynamic and heterogeneous web data to identify communities.