The invention relates in general to the field of computer-implemented methods for identifying, managing and displaying a large set of relationships between entities. In particular, it relates to co-clustering methods.
Graphs are a popular data representation for modeling relationships, connections, etc., between entities. For example, bi-partite graphs have been the focus of a broad spectrum of studies spanning from document analysis to bioinformatics. A bi-partite graph paradigm may indeed be relied upon to represent various kinds of relationships, e.g., between parts of a computer-aided designed or CAD complex objects, real-world objects and attributes, etc., or even to represent data acquisition patterns between sets of processor cores and sets of data. Analysis of such related data is therefore of great importance for many companies, which accumulate increasingly large amounts of interaction data.
One common approach involves the identification of groups of objects or entities that share common properties, have similar attribute values, etc. The availability of such information is advantageous in many respects, as patterns can be detected, improper relations can be repaired or even anticipated.
Studies have suggested that matrix-based representations are more suitable and offer “superior readability” compared to node-link representations, particularly when analyzing large number of subjects/variables. In some cases, one has interest in visualizing thousands of subjects and several dozens to hundreds of variables, therefore a matrix representation can advantageously be adopted for bi-partite graphs. Given a matrix data representation, the problem of simultaneous group discovery across two data dimensions can be mapped to a matrix co-clustering instance. The goal is to reveal the latent structure of a seemingly unordered matrix. This is achieved by discovering a permutation of matrix rows and columns, and a respective grouping, such that the resulting matrix is as homogeneous as possible. In a typical setting as contemplated herein, the rows represent the subjects (CAD objects or parts, cores, etc.) and the columns identify the variables (other entities to which the subject entities relate, attribute values, data accessed by a given processor, etc.).
Presently, techniques for matrix co-clustering are predominantly based either on hierarchical clustering or on spectral clustering principles. As we discuss in more detail later on, both approaches exhibit limited scalability. The aim of the present approach is to provide a highly scalable approach that supports the analysis of thousands of graph nodes, and can easily drive interactive visual interfaces.
The principle of co-clustering was introduced first by Hartigan with the goal of ‘clustering cases and variables simultaneously’. Initial applications were for the analysis of voting data. Since then, several co-clustering algorithms have been proposed, broadly belonging into two classes, based on: a) hierarchical clustering, and b) spectral clustering.
Agglomerative hierarchical clustering approaches are widely used in biological and medical sciences. In this setting, co-clustering also appears under the term ‘bi-clustering’. One application is for the analysis of gene expression profiles. Columns and rows of an expression profile matrix are sorted using the relative orders of the leaves of the corresponding dendrograms constructed for genes and for arrays. The reordering of the dendrogram leaf objects is called seriation. Hierarchical clustering approaches can lead to discovery of very compact clusters. However, this comes at a high runtime complexity, i.e., ranging from O(n2) to O(n2 log2 n)—n being the number of objects—depending on the agglomeration process. Therefore, their applicability is limited to data instances that typically do not exceed several hundreds of objects. Such approaches are deemed prohibitive, even for today's computers, if one considers interactive response times.
Spectral co-clustering approaches view the co-clustering problem as an instance of graph partitioning. Essentially, the problem is relegated to an eigenvector computation. Spectral clustering approaches are powerful for detecting non-linear cluster relationships (e.g., concentric circles). However, for some cases, including those contemplated here, one is interested in detecting rectangular clusters; hence, it can be realized that computationally simpler techniques may also discover the existence of rectangular co-clusters. The complexity of spectral approaches is in the order of O(n log2n). Recent works report a runtime of several seconds for a few thousands of objects; as such, their usefulness is typically limited to small data instances (fewer than 104 nodes).
In the last years, approaches have appeared that view co-clustering from a purely optimization perspective and do cluster assignments using an information theoretic objective function. So, the optimal co-clustering maximizes the mutual information between the clustered random variables.
In the field of visualization, several techniques have been proposed for visualizing bipartite graphs. Such approaches do usually not involve co-clustering.
Finally, there exist approaches that encapsulate hybrid visualization methods, using a combination of matrix and node-link techniques, so as to accommodate a more holistic graph exploration experience.