1. Field
The present disclosure relates to the fields of bioinformatics and health care and more particularly relates to clustering of sub-networks of a network based on a user customizable similarity coefficient, for bioinformatics and healthcare applications.
2. Description of Related Art
Recent progress in medical science, bioinformatics and biotechnology has led to the accumulation of tremendous amounts of biological data such as gene expression data. Analysis and interpretation of this massive data is a challenging task. Moreover, with the advent of microarrays and next-generation biotechnological methods, the use of a large amount of gene expression data has become ubiquitous in biological research. For example, the large amount of gene expression data can be used to generate various biological networks such as a gene interaction network or a protein interaction network. Various bioinformatics studies propose to analyze the gene expression data at a group-level of functionally related genes such as pathways or sub-networks.
However, generating optimized sub-networks for better and more accurate analysis remains a challenging task. Some existing conventional methods generate sub-networks based on algorithms that grow seeds (initial sub-networks) using term enrichment test and scoring functions. Some existing sub-network generation algorithms grow the seeds by merging of generated small size sub-networks based on some pre-defined neighboring criteria. There are situations when a scoring function returns no gain or when genes around the seed fail to satisfy the neighborhood criteria. In such situations, the resulting sub-networks are very small in size. Generating, very small sub-networks terminates the sub-network generation process resulting in a plurality of sub-networks that may not have any significance with respect to a particular desired similarity between them.
Parallel progress in data mining research provides efficient and scalable methods such as clustering, pattern analysis for mining interesting patterns and knowledge in large databases. Data mining techniques such as clustering can provide effective analysis of the gene expression data for various bioinformatics and health care applications. Clustering divides data of interest into a small number of relatively homogeneous groups. Clustering can be an effective tool in analysis of the gene expression data at the sub-network level.
Hierarchical clustering algorithms are a popular choice for a clustering approach that determines successive clusters using prior-established seed clusters. Conventional hierarchical clustering algorithms use distance metrics as criteria for clustering. These hierarchical clustering algorithms based on distance metrics are better applicable and provide reliable results for mostly numeric data.
Another existing hierarchical clustering algorithm for Boolean and Categorical data utilizes links instead of distance metrics as clustering criteria. The links captures the neighbourhood-related information of the data. The higher the number of links, the higher is the similarity between the data being compared. The link refers only to direct links (i.e., a direct relation) existing between two data items or data sets being compared. The existing method fails to consider indirect links between the data being compared, thus, maintaining a rigid approach for clustering. However, many bioinformatics, health care and non-biological applications can provide effective analysis if the indirect relation between the data analyzed is provided considerable weighting. However, the weighting required to be provided to indirect relationship may vary based on the end application. Thus, flexibility in defining clustering criteria, to be better suitable for the particular application, will be appreciated.