1. Field of the Invention
The illustrative embodiment of the present invention relates generally to group detection and more particularly to computer-implemented group detection.
2. Description of the Related Art
Group Detection (GD) is the act of finding clusters of associated entities given information about the connections between those entities. GD algorithms may be utilized to: (1) find groups; (2) create implicit links between individuals who are not explicitly linked; (3) identify couriers between groups; and (4) identify aliases or possible database errors. Group detection algorithms have a variety of applications in diverse industries and are not limited to the uses described in 1-4.
Group detection may be applied to a variety of domains. For example, GD may be utilized to identify team-membership given a dataset assembled from email traffic at a company. One might expect to see many emails exchanged between team members, and fewer emails exchanged between individuals who are on different teams. The high occurrence of emails between certain individuals implies team membership. Other applications include, for example: (1) finding cliques or social-groups given information about the communication habits of individuals; (2) finding related documents given information about document citation; and (3) finding athletic conferences given a teams' playoff schedule.
Manually looking for groups in a large dataset is nearly impossible. FIG. 3 shows a graph with 323 nodes and 4579 edges, which represents a simple dataset. With this relatively small dataset, the groups are very difficult to spot/identify with conventional methods, which are performed manually. Thus, a small number of group detection algorithms have been created/proposed. These few group detection algorithms that currently exist are based on probabilistic generative models. Probabilistic generative models assume that some parameterized random process generated the data, and these models try to learn the parameter values that best explain the data. With these models, analysts provide information such as the probability of a random link occurring between any two entities, and the models utilize this manually provided information to account for noise in the data. However, these algorithms are difficult to utilize when little to no information is known about the structure of the dataset. Furthermore, trial runs have shown that these algorithms perform poorly on datasets that lack noise.