Due to the significant increase of communications between individuals via social media (Facebook, Twitter) or electronic formats (email, web, co-authorship) in the past two decades, network analysis has become an unavoidable discipline.
It is nowadays extremely frequent to represent networks wherein individuals or devices are modelled by nodes, and wherein text data that associate a pair of nodes, such as an email or social network message sent from a sender to a recipient, are modelled by textual edges. Edges can be directed, in said case of an email from a sender to a recipient, or undirected, for instance if a text datum associating two nodes is a co-authored document associating two authors in an undirected manner. It is obviously of strong interest to be able to model and cluster those networks. Applications of network modelling exist in domains ranging from biology (analysis of gene regulation processes) to social sciences (analysis of political blogs) to historical sciences (for representing historical social networks).
Modelling of Networks of Binary Links Represented by Binary Edges
Statistical models for networks are known, which allow to infer clusters based on the existence of links between the nodes. In this context, the nodes can be represented, in a graphical representation of the network, by points, and the links that exist between the nodes (text data associated with a pair of nodes, such as an e-mail between two mail addresses) can be represented by mere binary edges that link the points. For instance, an adjacency matrix A=(Aij) can be associated with the network, and doubly indexed with the plurality of nodes of the network. The value Aij is then set to 1 when a link exists between nodes i and j, with a corresponding binary edge between points of nodes i and j being displayed on the graphical representation, and the value Aij is conversely set to 0 when no link exists, which translates on the graphical representation of the network by a lack of an edge between i and j.
Since statistical methods for analysis of networks have emerged about fifteen years ago, with the seminal work of Hoff et al. (2002)1, said methods have proven themselves as efficient and flexible techniques for network clustering. Most of those methods look for specific structures, so-called communities, which exhibit a transitivity property such that nodes of the same community are more likely to be connected (Hofman and Wiggins, 2008)2. An especially popular approach for community discovering operates a stochastic block model (SBM) which is a flexible random graph model. In this model, it is assumed that each vertex (each node) belongs to a latent group, and that the probability of connection between a pair of vertices (existence of an edge) depends exclusively on their respective groups. Because no specific assumption is made on the connection probabilities, various types of structures of vertices can be taken into account. Indeed, the SBM model allows to disclose communities, ie. groups of densely connected nodes wherein each node tends to communicate more with the other nodes than with nodes exterior to the community. But the stochastic block model also allows to disclose other types of subnetworks, such as star-shaped structures wherein one node is frequently linked to a plurality of other nodes that are not necessarily linked frequently to each other, or even disassortative networks, wherein nodes that are dissimilar tend to connect more than nodes that are similar. Use of a stochastic block model to modelize networks has been initiated by Nowicki and Snijders (2001)3.
However, considering only the network information may not be sufficient in order to obtain meaningful clusters. It is known of the prior art to take into account further information than the mere existence of a link between two nodes, such as the date of the text data that corresponds to a link (temporal edges) or the type of link (categorical edges). It is also known to have the edges weighted by number of links between the nodes, and/or preeminence of certain links over others.
Still, using only network information, without analyzing the corresponding text content, may be misleading in some cases, even with the use of categorized edges mentioned above which are a refinement of binary edges. As a motivating example, FIG. 1—which will be described in greater detail in the detailed description hereinafter—shows a network representation of the type mentioned above, wherein nodes taken from an exemplary textual network scenario which will be described below are clustered into 3 “communities”, obtained via a method using a stochastic block model (SBM). However, one of the communities in this exemplary scenario can in fact be split into two separate groups, based on the topics of communication between nodes internal to these two separate groups. A mere method of inference of clusters which does not take into account the topics of discussion between the nodes, while inferring clusters and clustering nodes into them, cannot recover this sub-structure of said group into two separate groups. In this scenario, it would be highly beneficial to obtain a clustering of network vertices that would take into account the content of the textual edges, with a semantic analysis being carried out in order to recover the topics of discussion in order to refine the clustering of the nodes. More generally, using a network analysis method which only relies on detection of binary edges, or refinements of binary edges, the textual content of the text data linking nodes of the network is not exploited whatsoever for finding meaningful clusters.
Semantic Analysis of Text of Documents
Independently from network analysis, statistical modelling of texts has appeared at the end of the last century for semantic analysis of texts, with an early model of latent semantic indexing (LSI) developed by Papadimitriou et al. (1998)4, allowing to recover linguistic notions such as synonymy and polysemy from term frequency within the text of documents. A first generative model of documents called probabilistic latent semantic indexing (pLSI) has been proposed by Hofmann (1999)5, wherein each word is generated from a single latent group known as a “topic”, and different words in the same document can be generated from different topics with different proportions.
Another model known as latent Dirichlet allocation (LDA) has subsequently been developed by Blei et al. (2003)6, which has rapidly become the standard tool in statistical text analytics. The idea of LDA is that documents are represented as random mixtures over latent topics, wherein each topic is characterized by a distribution over words. LDA is therefore similar to pLSI, except that the topic distribution in LDA has a Dirichlet prior. Note that a limitation of LDA would be the inability to take into account possible topic correlations. This is due to the use of the Dirichlet distribution to model the variability among the topic proportions.
Joint Analysis of Network Structure and Content of Textual Links
Moving back to the problem of obtaining a clustering of network vertices that would take into account the content of the textual edges, a few recent works have focused on the joint modelling of texts and networks. Those works are mainly motivated by the will of analyzing social networks, such as Twitter or Facebook, or electronic communication networks. Some of these models have been partially based on a latent Dirichlet allocation (LDA) generative model of textual communications, especially the author-topic (AT) model (Steyvers et al., 2004; Rosen-Zvi et al., 2004)7 and the author-recipient-topic (ART) (McCallum et al., 2005)8 models. The AT model extends LDA to include authorship information whereas the ART model includes authorships and information about the recipients. However, said models remain generative models of documents and do not allow to recover a network structure or a clustering of edges bound by text data.
An attempt at a model for joint analysis of text content and networks was made by Pathak et al. (2008)9 who extended the aforementioned ART model by introducing the community-author-recipient-topic (CART) model. The CART model adds to the ART model that authors and recipients belong to latent communities, and allows CART to recover groups of nodes that are homogenous both regarding the network structure and the message content. The CART model allows the nodes to be part of multiple communities and each couple of actors to have a specific topic. Thus, though extremely flexible, CART is also a highly-parametrized model which comes with an increased computational complexity. In addition, the recommended inference procedure based on Gibbs sampling may also prohibit its application to large networks.
Another model known as topic-link LDA (Liu et al., 2009)10 also performs topic modeling and author community discovery in a unified framework. Topic-link LDA extends LDA with a community layer; the link between two documents (and consequently its authors) depends on both topic proportions and author latent features. The authors derived an algorithm of the otherwise well-known Variational Expectation-Maximization type (VEM) for inference of the structure, allowing topic-link LDA to eventually be applied to large networks. However, a huge limitation of the topic-link LDA model is that it is only able to deal with undirected networks. Finally, a family of 4 topic-user-community models (TUCM) was described by Sachan et al. (2012)11. The TUCM models are designed such that they can find “topic-meaningful” communities in networks with different types of edges. Though, inference is done here through Gibbs sampling, implying a possible limitation as this method can be applied to very limited sets of network data.
Besides, a major drawback of the aforementioned methods for joint analysis of texts and networks is that they are not able to recover a whole range of structures such as communities, but also star-shapes or disassortative clusters, as defined above. Further, to the knowledge of the inventors, a complete implementation of a computer method using topic-link LDA in order to process text data of a network and infer a network structure has never been proposed yet.
Therefore, a need exists for a method for clustering nodes of a textual network, carrying out both network analysis (using information of the existence of text data associating a pair of nodes of the network) and semantic analysis (using information of topics inferred from the text data), and especially taking into account the content of the text data in order to characterize the clusters. Once it is detected that a certain node displays a certain behavior in terms of topics of discussion with other nodes, the needed method should seek to assign said node to a cluster which is consistent with its discussion behavior. This method must be of sufficient flexibility and reasonable computational complexity, must work for both directed and undirected networks, and must be highly interpretable and be able to operate on large-scale networks.