Herebelow, numerals in brackets—[ ]—are keyed to the list of references found towards the end of the instant disclosure.
Clustering large datasets is a challenging data mining task with many real life applications. Much research has been devoted to the problem of finding subspace clusters [2, 3, 4, 7, 11]. In this general, the concept of clustering has further been extended to focus on pattern-based similarity [16]. Several research efforts have since studied clustering based on pattern similarity [17, 13, 12], as opposed to traditional value-based similarity.
These efforts generally represent a step forward in bringing the techniques closer to the demands of real life applications, but at the same time, they have also introduced new challenges. For instance, the clustering models in use [16, 17, 13, 12] are often too rigid to find objects that exhibit meaningful similarity, and also, the lack of an efficient algorithm makes the model impractical for large scale data. Accordingly, a need has been recognized in connection with providing a clustering model which is intuitive, capable of capturing subspace pattern similarity effectively, and is conducive to an efficient implementation.
The concept of subspace pattern similarity is presented by way of example in FIGS. 1(a)-1(c). Shown are three objects. Here, the X axis represents a set of conditions, and the Y axis represents object values under those conditions. In FIG. 1(a), the similarity among the three objects is not visibly clear, until they are studied under two subsets of conditions. In FIG. 1(b), one finds the same three objects form a shifting pattern in subspace {b, c, h, j, e}, and in FIG. 1(c), a scaling pattern in subspace {f,d,a,g,i}.
Accordingly, one should preferably consider objects similar to each other as long as they manifest a coherent pattern in a certain subspace, regardless of whether their coordinate values in such subspaces are close or not. It also means many traditional distance functions, such as Euclidean, cannot effectively discover such similarity.
A need has been recognized in connection with addressing the problems discussed above in at least three specific areas: e-Commerce (target marketing); automatic computing (time-series data clustering by pattern similarity); and bioinformatics (large scale scientific data analysis).
First, recommendation systems and target marketing are important applications in the e-Commerce area. In these applications, sets of customers/clients with similar behavior need to be identified so that one can predict customers' interest and make proper recommendations. One may consider the following example. Three viewers rate four movies of a particular type (action, romance, etc.) as (1, 2, 3, 6), (2, 3, 4, 7), and (4, 5, 6, 9), where 1 is the lowest and 10 is the highest score. Although the rates given by each individual are not close, these three viewers have coherent opinions on the four movies, which can be of tremendous benefit if optimally handled and analyzed.
Next, scientific data sets usually involve many numerical columns. One such example is the gene expression data. DNA micro-arrays are an important breakthrough in experimental molecular biology, for they provide a powerful tool in exploring gene expression on a genome-wide scale. By quantifying the relative abundance of thousands of mRNA transcripts simultaneously, researchers can discover new functional relationships among a group of genes [6, 9, 10].
Investigations show that more often than not, several genes contribute to one disease, which motivates researchers to identify genes whose expression levels rise and fall coherently under a subset of conditions, that is, they exhibit fluctuation of a similar shape when conditions change [6, 9, 10]. Table 1 (all tables appear in the Appendix hereto) shows that three genes, VPS8, CYS3, and EFB1, respond to certain environmental changes coherently.
More generally, with the DNA micro-array as an example, it can be argued that the following queries are of interest in scientific data analysis.