Most information databases and knowledge repositories may be viewed as comprising classes of objects that interact with each other, as qualified by different relationships. These classes of objects and their interactions may also change with time, providing a dynamic view of the interaction patterns. Thus, based on available meta-information about the objects and their relationships, one may capture a body of knowledge in terms of a dynamic complex network, where nodes represent entities or objects belonging to the different object classes, and links represent the fact that the associated nodes are related via a particular type of relationship. For example, in a friendship information database, the nodes correspond to individuals, and links correspond to the fact that two individuals know each other. To capture the complex nature of, and nuances inherent in, almost all information repositories, a linked database or the network representation has to be suitably annotated. For example, in the case of friendship information, each node would have relevant information about the individual it represents (e.g., age, sex, race, location, hobbies, profession etc.) and each link has to be qualified with attributes, such as the nature of relationship (e.g., romantic, work related, hobby related, family, went to school together etc.) and the strength of the relationship (e.g., frequency of contacts etc.).
The above-mentioned linked database or information network may easily become very large-scale, comprising millions of nodes and links. For example, the world wide web (www) comprises a network of this type with potentially billions of nodes and links and complex relationships that qualify the links connecting the nodes or URLs. The large-scale and time-varying nature of such networks make them dynamic complex networks, and their size has prevented a direct and comprehensive mining and querying of such networks. The most common strategy has been to build structured databases, derived from the underlying network, and then to query these structured databases efficiently using existing tools. However, these indexed databases only capture particular slices or projections of the underlying network and do not provide answers to queries that do not directly fit the slice that was extracted to create the database. A good example is the service provided by Google: Given key words, it provides one with web pages that have the specified key words, and ranked according to their relevance or importance; the relevance or importance of a page is determined by its location in the global www, i.e., how many other “important” pages point to it etc. However, if one were to ask, for example, what is a company's web presence, in the sense of what types of individuals and news organizations are reporting on the company and who they represent and if they are relevant or important to the company, then there are no easy key words to get this information; and one may have to perform an exhaustive search with different key words followed by much post-processing in order to infer such information. Even then, one might get only those individuals or organizations who have directly reported on the company and it will be hard to get other individuals and organizations that are closely related to these direct reporters. Clearly, such information is embedded in the underlying network but not accessible via key words based searches. It has not been clear how one might address this issue and extract such information efficiently.
Recently, some progress has been made in this direction and people have started exploring so-called “communities” in complex networks or graphs. The underlying motivation comes from the fact that often we know a lot about an individual by studying the communities that the individual belongs in. The concepts of such “communities” have been solely structural so far, and different researchers have used different concepts of communities in the literature. However, a common thread is the understanding that a structural community is a set of nodes that are much more interconnected amongst themselves than with the rest of the nodes in the network.
Until recently the problem of finding communities in complex networks has been only studied in context of graph partitioning. Recent approaches [9, 12, 15, 21] provide new insight into how the communities may be identified and explored by optimizing the modularity partitioning of the network. These methods, inspired by diffusion theory, prune the edges with high betweenness to partition the graph from top to bottom to get cohesive communities.
Finding community structure of networks and identifying sets of closely related vertices have a large number of applications in various fields. Different methods have been used in the context of parallel computing, VLSI CAD, regulatory networks, digital library and social networks of friendship. The problem of finding partitioning of a graph has been of interest for a long time. The K-L (Kernighnan-Lin) algorithm was first proposed in 1970 for bisection of graphs for VLSI layouts to achieve load balancing. Spectral Partitioning [14] has been used to partition sparse matrices. Hierarchical clustering [18] has also been proposed to find cohesive social communities. While these algorithms perform well for certain partitioned graphs, they fail to explore and identify the community structure of general complex networks. In particular they usually require the number of communities and their size as input.
A number of divisive and agglomerative clustering algorithms are proposed. These algorithms, mostly inspired by diffusion theory concepts, identify boundaries of communities as edges or nodes with high betweenness. While there is no standard definition for a community or group in a network, they use a proposed definition based on social formation and interaction of groups [19]. Radicchi et. al. [15] similar to [9] define communities in strong and weak sense. A subgraph is a community in a strong sense if each node has more connections within the community than with the rest of the graph. In a similar fashion, a subgraph is a community in a weak sense if the sum of all degrees within the subgraph is larger than sum of all degrees toward the rest of the network. A similar definition is used in [7] to define web communities as a collection of web pages such that each member page has more hyper-links (in either direction) within the community than outside of the community. Inspired by the social definition of groups, Girvan and Newman [9] propose a divisive algorithm using several edge betweenness definitions to prune the network edges and partition the network into several communities. This algorithm has a heavy computational complexity of O(m2n) on an arbitrary network with m edges and n vertices. Faster algorithms are based on betweenness and similar ideas [12, 15, 21] and a modularity measure is proposed [12] to measure quality of communities. A faster implementation of [12] is reported [4] to run more quickly: O(md log n) where d is the depth of the dendrogram describing the community structure of the network.
Fast community finding algorithms using local algorithms may help in analyzing very large scale networks and may prove useful in complex network identification and analysis applications. These methods are applied to a number of different applications including social networks [13], biological networks [3, 17] and software networks [11]
However, the proposed methods fail to identify overlapping communities and how strong a node belongs to a community. They also require global knowledge of the network to generate communities of a particular subset of the network. Hueberman et. al. [21] note that a GN algorithm may be highly sensitive to network structure and may result in different solutions with small perturbation in network structure. As a solution they propose a randomized version of these algorithms to achieve robustness and confidence in community structure. But the algorithm is still centralized and requires global knowledge of the network. A number decentralized algorithms are based on random walks [10], or l-shell spreading [1]. These algorithms propose local methods to identify community structure of complex networks.
The proposed approaches have shortcomings, including the following.
Requirement for Global Knowledge. Proposed approaches require a global knowledge of network structure. i.e. they need to know global structure of the network in order to discover community structure of a particular subset of nodes and their surroundings. This is especially important for large scale networks where one is usually interested in communities of a particular node or set of nodes.
Inability to Deal with Overlapping Communities. Proposed community finding algorithms still find only cohesive subgroups. [19], i.e. they partition the network into communities and provide a dendrogram of community structure. It is noted that cohesive subgroups like LS and λ sets may not overlap by sharing some but not all members [19][23]. The fact that these sets are related by containment means that within a graph there is a hierarchy of a series of sets. Often, real-world networks do not have cohesive and independent clusters, but rather have overlapping communities like affliation networks. Such networks are two-mode networks that focus on the affliation of a set of actors with a set of events or communities, where each event consists of a subset of possibly overlapping communities. New algorithms are then needed to capture overlapping of communities.
Complexity. An implementation of Newman fast community finding [4] is reported to run in O(md log n) where d is the depth of the dendrogram describing the community structure. For many applications it is only required to find a community of a certain size related to a subset of nodes. Proposed diffusion-based algorithms do not scale in the sense that they require processing of the whole network to get local structures. A down to top local algorithm may provide flexibility of search constraints.
Lack of Confidence. One GN method does not provide any confidence for nodes in a community. This issue is revisited in [21] but still there is no complete framework defined to measure confidence of a node belonging to a community.
Structural vs. Informational Communities: The existing community finding algorithms find communities comprising nodes that are clustered or more linked among themselves than with the rest of the nodes in the network. However, in a linked database, there are different types of edges and nodes, and one might be interested in communities with respect to different relationships. For example, in the friendship network, we might be interested only in the communities that are based on romantic and family relationships. In such a case, we are dealing with a sub-network of interest where only the edges representing such relationships are kept and others are deleted. Similarly, one might ask about the community structure specific only to a time period or those restricted to a set of geographical locations. Such communities may be referred to as informational communities. It is clear that if one were to pre-compute such informational communities and their various combinations, unions, and intersections, for each node, then one will hit the wall of combinatorial explosion very soon. This further underscores the need for finding query-based informational communities. Moreover, as noted earlier one might be interested in informational communities of a particular node or a set of nodes.