Networks or graphs are useful in describing and quantifying relationships between entities in a broad variety of complex systems, such as the world wide web, the Internet and social, biochemical and ecological systems. Studies suggest that networks often exhibit hierarchical organization, where vertices divide into groups that further subdivide into groups of groups, and so on. In many cases, these groups are found to correspond to known functional units, such as ecological niches in food webs, modules in biochemical networks, or communities in social networks. Network analysis has hence been widely and successfully used in areas such as intelligence data analysis, social network analysis, Internet data processing, authorship networks, bioinformatics and medical data processing, and many others.
A hierarchical random graph (HRG) is a useful tool for clustering nodes in network graphs according to their connectivity with one another. The basic HRG algorithm was developed by Aaron Clauset, and employs Markov Chain Monte Carlo (MCMC) simulation methods to compute a population of binary trees, called dendrograms. The general HRG concept is described in further detail in Clauset et al., “Structural Inference of Hierarchies in Networks,” Airoldi, E. M. et al. (eds.), IMCL 2006 Workshop, Lecture Notes in Computer Science 4503; 1-13 (2007), and Clauset et al., “Hierarchical Structure and the Prediction of Missing Links in Networks,” 453 Nature 98-101 (May 2008), the content of both of which are incorporated herein by reference.
In general terms, given a network graph G with n vertices, a dendrogram D is a binary tree with n leaves corresponding to the vertices of G, in which pairs of nodes are organized according to their connectivity in the network and closely emulate the structure of the original network. Each branch of the tree only has two children. The nodes of the network naturally cluster themselves in the tree by placing nodes that are very closely connected in the network close to one another in the dendrogram. That is, such nodes share a very low-level common branch. Nodes that are less connected, however, share a higher-level branch. Nodes that are very far apart are connected at the highest level of the dendrogram. A hierarchical random graph is a combination of a dendrogram along with its probabilities.
FIGS. 1A-1B are respectively a schematic diagram of a network graph and a schematic diagram of a corresponding dendrogram generated via the traditional HRG algorithm. Initially, nothing is known about the network except for the connectivity of the nodes. This is analogous to the network graph of FIG. 1A without any shading. At this initial stage, the network is disorganized and difficult to interpret. After processing the network into a dendrogram, the connectivity of the nodes becomes much clearer. It is possible to see the relationships between nodes based on the height of their common branch. For example, the dendrogram contains a group of three nodes 120. The two nodes to the right are connected at the lowest level, while the one node to the left is connected at the next level. These nodes are relatively far from another group of nodes 121 at the far right of the dendrogram; their common branch is at the top 122, indicating that they are not strongly related in the network.
After the dendrogram is generated, one can color code or shade the nodes in the network graph based on their closeness. The shading in FIG. 1A is the result of the clustering by the dendrogram. Thus, computing a dendrogram allows an individual to easily see the relationships in the network data, which might not be apparent from simple inspection of the network graph.
One drawback of the traditional HRG framework is that it is only applicable to simple networks in which the links between nodes exhibit an all or nothing behavior. That is, in the traditional HRG algorithm, either two nodes in the network are connected fully, or they are not connected at all. This limits the utility of the HRG algorithm to an extremely small subset of network science problems, such as those in ecology. For problems that require the analysis of networks where nodes have different connection strengths, or those networks which connectivity changes over time (e.g., social networks, etc.), the application of traditional “all or nothing” HRG is insufficient.
There are plenty of networks where links between nodes must be expressed in terms of a weight, such as, for example, to express quantity of goods flowing through a supply chain network, frequency of communication in e-mail networks or cell phone networks, and many others. In addition, when dealing with dynamic networks in which the links between nodes change as a function of time and activity, one must be able to express the strength of the connections between nodes as a continuous variable.
There are two possible ways to apply the traditional HRG algorithm to networks in which the connections have variable strengths, such as the weighted and dynamic networks. The simplest method is to consider all nonzero weights as generic connections. The actual weights of these connections must be handled internally by the algorithm, but do not affect the calculation of dendrograms during the MCMC process. The problem with this method is that it does not differentiate between very strong and very weak connections. That is, a connection with strength 0.99 would exhibit the same connectivity as a connection with strength of 0.01, which would eliminate the ability for connections to compete against one another in the dendrogram population.
An alternative approach to thresholding the connections at zero is to have a variable threshold in which all connections with a weight greater than this threshold are considered connected, while those whose connection strengths fall below this threshold are considered disconnected. While this is an improvement from the threshold-at-zero approach, this is still insufficient.
Another drawback of the original HRG framework is that it exclusively concerns networks having only one kind of relational attribute between network nodes. There are, however, many circumstances where multiple kinds of attributes between nodes are present. For example, the relation between two people in a social network may be revealed by both physical meetings and electronic communications (phone calls, emails, etc.). The original HRG framework can deal with such networks by ‘flattening’ the multiple attributes into a single quantity. One example of such flattening is to simply compute an average of multiple attributes, to come up with a single number (a weight). However, much information is lost by flattening the attributes into a single weight.
Accordingly, what is desired is a system and method for clustering nodes in network graphs that takes into account the strength of the connection between two nodes, as well as multiple attributes that may be present between the nodes. Such a system and method may be desirable to model and analyze multi-modal, relational, and spatial-temporal, and multi-layered data to discover mixed communities from multi-layered relationships within the dataset.