Expanding a seed set of web pages into a larger web community is a common procedure performed in link-based analysis of websites. Although the seed expansion problem has been addressed by numerous researchers as an intermediate step of various graph-analytic analyses on the web, unfortunately existing techniques used for identifying web communities from web pages may be inefficient and provide less than optimal results. In some cases, a larger neighborhood of web pages may be examined than necessary for identifying communities of web pages. In other cases, communities identified may include web pages without a strong relationship to the web community. For instance, the HITS algorithm, well-known in the field, used a search engine to generate a seed set, and then performed a fixed-depth neighborhood expansion in order to generate a larger set of pages upon which the HITS algorithm was employed. The general technique of the HITS algorithm has seen broad adoption, and is now a common technique for local link-based analysis. Variants of this technique have been employed in community finding, in finding similar pages, in pagerank, in trustrank, and in classification of web pages. More sophisticated expansions have been applied in the context of community discovery.
However, expanding a seed set using a fixed-depth expansion may ignore a target community that includes the seed set and may result in rapid expansion from the seed set in the graph before a large fraction of the nodes in the target community have been reached. Thus a fixed depth expansion may result in a bad approximation of the community and may further produce an impractically large candidate set for further processing.
Other techniques have defined a community to be a subgraph bounded by a small cut, which may be obtained by first growing a candidate set and then pruning it back. This process may be repeated several times while adding nodes from the candidate set at each step to ensure expansion of the seed set. Another approach for ensuring a reasonable expansion of a seed set may be to apply graph conductance. Graph conductance, or the normalized cut metric, is a quotient-style metric that may provide an incentive for growing the seed set. But such improvement of a conductance score may come at the expense of adding barely related nodes, or even a disconnected component, to the seed set. As a result, web pages without a strong relationship to the web community may be included in an identified web community.
What is needed is a way to identify communities with conductance guarantees that may also be computed locally by examining only a small neighborhood of the entire graph. Such a system and method should be able to ensure that the included web pages have a strong relationship to the identified web community.