The World-wide Web has several thousand well-known, explicitly-defined communities, i.e., groups of individuals who share a common interest, together with the Web pages most popular amongst them. Consider for instance, the community of Web users interested in Porsche Boxster cars. Indeed, there are several explicitly-gathered resource collections, such as those listed under the category of “Recreation: Automotive: Makes and Models: Prosche: Boxster” at the Yahoo Web site (yahoo.com), which are devoted to the Boxster. Most of these communities manifest themselves as new groups, Web rings, or as resource collections in directories such as Yahoo! and Infoseek, and as homesteads on Geocities. Other examples include popular topics such as “Major League Baseball,” or the somewhat less visible community of “Prepaid phase card collectors”. The explicit nature of these communities makes them easy to find. It is simply a matter of visiting the appropriate portal or news groups.
Even with such a large number of explicitly-defined communities on the Web, there are still several tends of thousands of other implicitly-defined communities due to the distributed and almost chaotic nature of the content-creation on the Web. Such implicitly-defined communities often focus on a level of detail that is typically far too fine to attract the current interest (and resources) of large portals to develop long lists of resource pages for them. Viewed another way, what is needed are methods for identifying Web communities at a far more nascent stage than do systematic and institutionalized ontological efforts.
There are at least three reasons for systematically extracting such communities from the Web as they emerge. First, these communities provide valuable and possibly the most reliable information resources for a user interested in them. Second, they represent the sociology of the Web: studying them gives insights into the intellectual evolution of the Web. Finally, portals that identify and distinguish between these communities can target advertising at a very precise level.
These implicit communities seem to outnumber the explicit ones by at least an order of magnitude. It appears unlikely that any explicitly-defined manual effort can successfully identify and bring order to all of these implicit communities, especially since their number will continue to grow rapidly with the Web. Indeed, as shown later in the specification, such communities sometimes emerge in the Web even before the individual participants become aware of their existence.
There are several technologies that are of interest in identifying implicit communities on the Web. One of these relies on the analysis of the link structure of the Web pages. A number of search engines and retrieval projects have also used links to provide additional information regarding the quality and reliability of the search results. See, for instance, the HITS algorithm described in “Authoritative Sources In A Hyperlinked Environment,” by J. Kleinberg, Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), 1998, and “Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text,” by Chakrabarti et al., Proceedings of the 7th World-Wide Web Conference, Australia, 1998. The connectivity server described by Bharat et al. also provides a fast index to linkage information on the Web. See, for example, “The Connectivity Server: Fast Access To Linkage Information On The Web.” Proceedings of the 7th World-Wide Web Conference, Australia, 1998. Although link analysis was used as a search tool, it has never been applied for mining the community structure of the Web.
Another related area is information foraging. Prior work in information foraging generally have a few main themes. The first is the information search and foraging paradigm, originally proposed in the Web context by Pirolli et al. in the paper “Silk From A Sow's Ear: Extracting Usable Structures From The Web,” Proceedings of the ACM SIGCHI Conference on Human Factors in Computing, 1996. Here, the authors show that Web pages fall into a number of types characterized by their role in helping an information forager find and satisfy his/her information need. The categories are much finer than the hub and authority view taken by Kleinberg and Chakrabarti et al. They also find that the classification of Web pages into specified types provides a significant “value add” to the browsing and foraging experience. Their techniques, however, appear unlikely to scale to the size of data currently existing on the World-wide Web.
A view of the Web as a semi-structured database has also been advanced by several authors. See, for example, “The Lorel Query Language For Semistructured Data,” S. Abiteboul et al., International Journal on Digital Libraries, pages 68-88, No. 1, Vol. 1, 1997 and “Querying the World Wide Web,” Mendelson et al., International Journal of Digital Libraries, pages 54-56, No. 1, Vol. 1, 1997. These views support a structured query interface to the Web, which is evocative of and similar to Structured Query Language (SQL). An advantage of this approach is that many interesting queries, including methods such as HITS (ref.), can be expressed as simple expressions in the very powerful SQL syntax. The corresponding disadvantage is that this generality comes with an associated computational cost which is prohibitive in the our context.
Another system, Squeal, was described by Ellen Spertus in “ParaSite: Mining the Structural Information On the World-Wide Web,” PhD, Thesis, MIT, February 1998, where it is built on top of a relational database. The relations that are extracted by this system and maintained in the underlying database allow for the mining of several interesting pages and interesting structures in the Web graph. Again, the value of such a system is in providing a more powerful interface which allows the relatively simple specification of interesting structures in the Web graph. However, the generally of the approach is a primary inhibiting factor in scaling it to large data sets.
Traditional data mining techniques may also be considered to search the Web for hidden communities, such as the one described by Agrawal et al. in the paper entitled “Fast Algorithms For Mining Association Rules,” Proceedings of the Very Large Data Base Conference, Santiago, Chile, 1994. Data mining, however, focuses largely on algorithms for inferring association rules and other statistical correlation measures in a given dataset. The notion of trawling differs from data mining in several ways. On one hand, trawling concerns with finding structures that are relatively rare, i.e., the graph-theoretic signatures of communities being looked for number perhaps only a handful for any single community. Second, exhaustive search of the solution space is infeasible, even with efficient methods such as a priori described by Agrawal et al. Unlike market baskets, where there are at most about a million distinct items, there are between two to three orders of magnitude more “items”, i.e., Web pages, in this case. Finally, the relationship that one would be interested in, namely co-citation, is effectively the join of the Web “points to” relation and its transposed version, the Web “pointed to by” relation. The size of this relation is potentially much larger than the original “points to” relation. Thus, one would need a method that works implicitly with the original “points to” relation, without ever computing the co-citation relation explicitly. The issue then is to find trawling methods that scale to the enormous size of the World-wide Web.
The work of Mendelson et al. described in “Finding Regular Simple Paths in Graph Databases,” SIAM J. Comp. 24(6), 1995, pages 1235-1258, is an instance of structural methods in mining. The authors show that the traditional Structured Query Language (SQL) interface to databases is inadequate for specifying several structural queries that are interesting in the context of the Web. An example in the paper is the path connectivity between vertices that are subject to some constraints on the sequence of edges on the path (expressed as a regular expression). They show that structures such as these can be described in a more intuitive and graph-theoretic query language G+. The authors also provide several interesting algorithms and intractability results that relate to this and similar query languages. These algorithmic methods, although are very general, do not support the scale and efficiencies required for identifying implicit Web communities.
In the paper titled “Inferring Web Communities From Link Topology,” Proc. of the 9th ACM Conference on Hypertext and Hypermedia, 1998, Gibson et al. describe experiments on the Web where they use spectral methods to extract information about “communities” in the Web. The non-principal eigenvectors of matrices described by Kleinberg in “Authoritative Sources In A Hyperlinked Environment,” Proc. of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), 1998, are used to define the communities. It is shown that the non-principal eigenvectors of the co-citation matrix reveal interesting information about the fine structure of a Web community. While eigenvectors seem to provide useful information both in the context of search and clustering in purely text corpora as well, they can be computationally expensive on the scale of the Web. In addition, they need not be complete, i.e., instances of interesting structures could be left undiscovered. Unlike “false positives”, this may not be a problem as long as not too many communities are missed.
Therefore, there remains a need for a method and system for traveling the Web to identify implicitly defined communities of Web pages concerning specific topics of general interest, without the above-described drawbacks.