The advent of high-throughput genomics technologies has resulted in massive amounts of diverse genome-scale data. Gene expression data, measured by microarrays or next generation sequencing platforms, are the most prevalent data type available for biological data analysis. Gene Expression Omnibus stores thousands of datasets with independent experimental series of similar patient cohorts and experiment design. As technologies advance, other data types become available and together they offer complementary information on the same disease or biological phenomenon. The Cancer Genome Atlas (TCGA) has already gathered genome, transcriptome, and epigenome information for over 20 cancers for thousands of patients. The challenge is to interpret the massive amounts of high-dimensional and heterogeneous data types to gain insights into biological processes.
Disease subtyping is often the first step to better understand a disease or biological phenomenon. The goal is to detect unknown groups of patients based on intrinsic features without external information. The disease subtyping problem includes the following fundamental issues: 1) how to determine the number of clusters and assign patients to each group, 2) how to combine complementary information to determine the final partitioning. The former problem often involves clustering mRNA expression where the data has small sample size but very high dimension. This is still an important problem since gene expression is one of the most prevalent data type available. The latter problem includes integration of multi-omics data, such as mRNA expression, DNA methylation, and miRNA, for class discovery. With the rapidly advancing technologies, more and more data types are available for the same set of patients, making the increasing need for combining multi-omics data.
In functional genomics, agglomerative hierarchical clustering (HC) is a frequently used approach for clustering genes or samples that show similar expression patterns. HC provides for a structural view of the data that makes it appealing in exploratory data analysis. However, classical HC imposes a tree structure on the data that might not reflect the underlying structure, and is highly sensitive to the metric used to assess similarity among elements. Divisive clustering methods, such as k-means, global k-means, fuzzy modification of k-means, have been applied for the same application. These methods provide clear cluster boundaries and tighter clusters, but they lack the visual appeal of HC. Another group of methods are neural network clustering, such as self-organizing maps (SOM), Self-Organizing Tree Algorithm (SOTA), and Dynamically Growing Self-Organizing Tree (DGSOT). Neural networks can be modeled as a collection of nodes with weighted interconnections, which can be adaptively learned. The common drawbacks of both k-means based methods and neural networks based methods is the need to specify the number of clusters beforehand.
Resampling-based methods have been proposed to determine the number of clusters. They assess the stability of the clustering results with respect to resampling variability. Arguably the state-of-the-art approach in this area is Consensus Clustering (CC). It develops a general, model independent resampling-based methodology of class discovery, cluster validation, and visualization. CC calculates the pair-wise similarities (frequency of how often the elements are grouped together) and their empirical cumulative distribution function (CDF) using sub-sampling. The pair-wise similarities are then used for visualization and for estimating the cluster number. This approach has been widely used and gained laudable results. The main assumption of CC is that if the samples were drawn from K distinct sub-populations that truly exist, different sub-samples would show the greatest level of stability at the true K. Unfortunately, this makes CC claim apparent structure when there is none, or declare cluster stability when the stability is subtle.
The goal of an integrative analysis is to identify subgroups of samples that are similar not only at one level (e.g., mRNA), but from a holistic perspective, that can take into consideration phenomena at various other levels (e.g., DNA methylation, miRNA, etc.). One strategy is to analyze each data type independently before combining them. One of the drawbacks of this approach is that it might lead to inconsistent conclusions that are hard to integrate. Another approach is to use machine learning techniques. However, these methods are not scalable to the full spectrum of measurements, making them sensitive to gene selection step. One recent approach, Similarity Network Fusion (SNF), creates a network of patients for each data type before fusing the network using a metric fusion technique developed for image processing applications. The fused network is then partitioned using spectral clustering. The unstable nature of the spectral clustering and the metric fusion technique makes the method sensitive to its parameters. In addition, this method is not designed to solve the clustering when only one data type is available.