Networks, such as social networks, can often be described by an interaction graph. An interaction graph comprises nodes and edges between nodes, wherein nodes represent entities and edges represent interactions between entities. For example, interactions may include entities coauthoring a paper, sharing a phone call, or exchanging an email.
Given such a graph and an entity of interest in that graph, a goal is to find the natural grouping or community around that entity. A “community” is a group of entities that interact more with each other than with the rest of the users. This notion tries to capture the way information flows in social networks. Notice that even if two entities in a community are not directly connected, if there are enough paths between them, information spreads from one to the other rather quickly.
It is desirable that a community search exhibit the following characteristics:
(i) Focused. In many cases, such as tracking a crime group, it is the community around a particular individual or set of individuals that matters. Traditional approaches, rather than focusing on a specific individual, work on the entire graph. On a graph consisting of millions of users, this is unnecessary work.
(ii) Scalable. Algorithm for focused community search should work in a scenario with millions or tens of millions of entities.
(iii) Robust. It is realized that data used to create an interaction graph is imperfect. Some links will be false positives, representing interactions that have nothing to do with the community of interest. Second, some links representing relationships may not be present. This could be simply because some data is unobserved (e.g., not all papers are in the scientific literature digital library known as CiteSeer), or it could be because of the community structure (e.g., criminals hiding their interaction by dealing via a third party).
Unfortunately, existing approaches are unable to satisfy all three conditions.
One problem in a community search task is identifying communities. Informally, a community is a group of entities that belong together. Real-life communities are formed by people working together, sharing a hobby, living nearby each other, etc. Making this intuition mathematical is difficult.
The simplest approach to community discovery around a starting point R is to return all nodes with direct links to R. This is the approach taken by Cortes et al., “Communities of Interest,” Proceedings IDA2001, 2001, and in Aiello et al., “Analysis of Communities of Interest in Data Networks,” Proceedings of PAM, 2005. While scalable and focused, these approaches are not robust. First, some neighbors of R result from essentially random interactions, and do not reflect any sort of community at all. More seriously, much of the community might not be a direct neighbor of R. Consider a graduate student in the data mining community who reviews papers, attends conferences, and so on. If the graduate student has only direct links in a citation graph to his or her advisor and fellow graduate students, then the student will not show up as part of the community. A neighbors-only search starting at any researcher other than the advisor leaves the student out.
A natural extension of this approach is to look at entities within a certain distance of the start entity. This is called a distance-based community discovery approach. While this increases the number of relevant entities found, it also increases the number of irrelevant entities found, thus increasing the recall at the cost of reduced precision. Furthermore, because social network graphs tend to have a small diameter, doing this sort of expanding ring search is likely to be quite expensive.
Accordingly, there is a need for community discovery techniques which overcome the above drawbacks, as well as other drawbacks, associated with existing community discovery approaches.