There is an existing and growing demand for automated software tools that can provide a simple, quick, and inexpensive way for users in business, government, and academia to collect information on the nature of virtual communities on the World Wide Web (or “Internet” generally) and the sites that make them up. Research about the Internet, and particularly the World Wide Web, is perhaps the fastest-growing empirical field of study in the social sciences, and is gaining great prominence in the government policy and business arenas as well. In particular, researchers are anxious to know more information about the characteristics of a community of sites relating to a particular topic, the relationships between sites in such a community, and the type of audience to which such sites appeal.
However, one notable thing about existing studies of the social structure of the Web is that few make any use of systematic quantitative or qualitative data, particularly data drawn from the Web itself. This shortage is not due to any lack of interest in systematic analysis within the academic, business, and government communities. Indeed, statistics, content analysis, and other systematic techniques are pervasive throughout each of the aforementioned professions. Nor is problem the inaccessibility of information from which such data could be generated, since one of the notable characteristic of the Web is the fact that much of the information on it is publicly accessible to anyone with an Internet connection. The real problem is the lack of a readily available means for extracting systematic data on virtual communities from available information, and putting in usable form.
The present inventor previously published a paper entitled, “Applying a New Empirical Technology to the World Wide Web”, by Sun-Ki Chai, 97th Annual Meeting of the American Sociological Association, August 2002, Chicago, Ill. The paper noted the lack of automated software tools for systematic collection of data on Web community sites, and proposed the concept of using a “centrality algorithm” to collect quantitative information about sites of a Web community for statistical analyses. The centrality algorithm begins with an initial site or set of sites, as might be located through means of a Web directory or link page. As each site is downloaded, information is compiled on its links to outside sites contained within the site's web pages, as well as the size of its content. The software also compiles information on other sites that link to each site and the site's overall popularity, based on information available at various web search engines and mapping sites. The methodology then incrementally adds additional sites to the set, using a priority ranking algorithm based upon links in and links out to the existing set. Such an algorithm ensures that the incrementally growing set is highly cohesive as a network, and thus can appropriately be viewed as a virtual community. Once a suitable set of sites is downloaded, centrality measures within the community are calculated by application of link information, and are supplemented with data about the basic characteristics of each site, such as domain, age, and popularity.
“Centrality” concepts for locating important actors in social networks have originally been proposed in the sociological literature, and various versions of such concepts exist, although they have not been explicitly used to analyze sites on the internet or to identify virtual communities. A method of ranking linkages from related sites to a reference site, similar to centrality, has been used in the “PageRank” method developed by Sergey Brin and Lawrence Page for the well-known Google™ search engine for Internet searches. However, its implementation as a ranking algorithm for searches does not allow it to be used to identify virtual communities of interest to users.
There are also publicly available software for organizing Internet browsing such as: (1) the class of “site-rippers” that help users batch download large numbers of web pages onto their local drives for offline browsing; and (2) the class of “browser assistants” that provide supplementary information on browsed pages and/or help to index and label the browser cache. Neither type of software is really designed to provide systematic data on an entire virtual community, much less output this data in a form that can be used by other analytical software.
In summary, no software is known that has yet been provided to systematically gather and provide data on virtual communities on the Web. It would be desirable to provide a software agent capable of automatically crawling on the Web and locating virtual communities of interest to a user, and in particular, identifying key sites within the community, and determining the content patterns that characterize the communications of such key sites. It would further be desirable for the software agent to collect quantitative and qualitative data about the characteristics of sites in the community in a form that can be used by standard statistical, content analysis, and other data processing software.