1. Ensemble Embedding
The analysis and classification of high-dimensional biomedical data has been significantly facilitated via the use of dimensionality reduction techniques, which allow classifier schemes to overcome issues such as the curse of dimensionality. This is an issue where the number of variables (features) is disproportionately large compared to the number of training instances (objects) (Bellman R: Adaptive control processes: a guided tour. Princeton University Press, 1961). Dimensionality reduction (DR) involves the projection of data originally represented in an N-dimensional (N-D) space into a lower n-dimensional (n-D) space (known as an embedding) such that n<<N. DR techniques are broadly categorized as linear or non-linear, based on the type of projection method used.
Linear DR techniques use simple linear projections and consequently linear cost functions. An example of a linear DR scheme is Principal Component Analysis (PCA) (Jollife I: Principal Component Analysis. Springer, 2002), which projects data objects onto the axes of maximum variance. However, maximizing the variance within the data best preserves class discrimination only when distinct separable clusters are present within the data, as shown in Lin, T. et al. (IEEE Transactions on Pattern Analysis and Machine Intelligence 30(5):796-809, 2008). In contrast, non-linear DR involves a non-linear mapping of the data into a reduced dimensional space. Typically these methods attempt to project data so that relative local adjacencies between high dimensional data objects, rather than some global measure such as variance, are best preserved during data reduction from N- to n-D space (Lee, G. et al., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(3):1-17, 2008). This tends to better retain class-discriminatory information and may also account for any non-linear structures that exist in the data (such as manifolds), as illustrated in Saul, L et al. (Journal of Machine Learning Research, 4:119-155, 2003). Examples of these techniques include locally linear embedding (Saul, L. and Roweis, S., Journal of Machine. Learning Research, 4:119-155, 2003) (LLE), graph embedding (GE)(Shi, J. and Malik, J. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8):888-905, 2000), and isometric mapping (ISOMAP) (Tenenbaum, J. et al., Science 290(5500):2319-2323, 2000). Recent work has shown that in several scenarios, classification accuracy may be improved via the use of non-linear DR schemes (rather than linear DR) for gene-expression data (Lee, G. et al., IEEE/ACM Transactions on Computational Biology and Bioinformatics, 5(3):1-17, 2008; Dawson K, et al., BMC Bioinformatics, 6:195, 2005) as well as medical imagery (Madabhushi, A. et al., In Proc. 8th Int'l Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI), Volume 8(1), 729-737, 2005; Varini, C. et al., Biomedical Signal Processing and Control 1:56-63, 2006).
However, typical DR techniques such as Principal Component Analysis (PCA), graph embedding (GE), or locally linear embedding (LLE) may not guarantee an optimum result for one or both of the following reasons: (1) Noise in the original N-D space tends to adversely affect class discrimination, even if robust features are used (as shown in Quinlan J, The effect of noise on concept learning. In Machine learning: An artificial intelligence approach. Edited by Michalski R S, Carbonell J G, Mitchell T M, Morgan Kaufmann, 149-166, 1986). A single DR projection may also fail to account for such artifacts (as shown in Balasubramanian, M. et al., Science, 295(5552):7a, 2002; Chang, H. and Yeung, D. Robust locally linear embedding. Pattern Recognition 39(6):1053-1065, 2006); (2) Sensitivity to choice of parameters being specified during projection; e.g. in Shao, C. et al., Dianzi Xuebao (Acta Electronica Sinica) 34(8):1497-1501, 2006), it was shown that varying the neighborhood parameter in ISOMAP can lead to significantly different embeddings.
1.1 Classifier and Clustering Ensembles
Researchers have attempted to address problems of classifier sensitivity to noise and choice of parameters via the development of classifier ensemble schemes, such as Boosting (Freund Y, Schapire R: A decision-theoretic generalization of on-line learning and an application to boosting. In Proc. 2nd European Conf. Computational Learning Theory, Springer-Verlag: 23-37, 1995) and Bagging (Breiman L: Bagging predictors. Machine Learning 24(2): 123-140, 1996). These classifier ensembles guarantee a lower error rate as compared to any of the individual members (known as “weak” classifiers), assuming that the individual weak classifiers are all uncorrelated (Dietterich T: Ensemble Methods in Machine Learning. In Proc. 1st Int'l Workshop on Multiple Classifier Systems, Springer-Verlag, 1-15, 2000). Similarly a consensus-based algorithm has been presented (Fred, A. and Jain, A., IEEE Transactions on Pattern Analysis and Machine Intelligence 27(6):835-850, 2005) to find a stable unsupervised clustering of data using unstable methods such as k-means (MacQueen J: Some Methods for Classification and Analysis of Multi Variate Observations. In Proc. Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, 281-297, 1967). Multiple “uncorrelated” clusterings of the data were generated and used to construct a co-association matrix based on cluster membership of all the points in each clustering. Naturally occurring partitions in the data then were identified. This idea was extended further in Fern, X. and Brodley, C., Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach. In Proc. 20th Int'l Conf. Machine Learning 2003:186-193), where a combination of clusterings based on simple linear transformations of high-dimensional data was considered. Therefore, ensemble techniques (1) make use of uncorrelated, or relatively independent, analyses (such as classifications or projections) of the data, and (2) combine multiple analyses (such as classifications or projections) to enable a more stable result.
1.2. Improved Dimensionality Reduction (DR) Schemes to Overcome Parameter Sensitivity
As shown by Tenenbaum, J. et al., (Science 290(5500):2319-2323, 2000), linear DR methods such as classical multi-dimensional scaling (Venna J and Kaski S: Local multidimensional scaling. Neural Networks 19(6):889-899, 2006) are unable to account for non-linear proximities and structures when calculating an embedding that best preserves pairwise distances between data objects. This led to the development of non-linear DR methods such as LLE (Saul L and Roweis S, Journal of Machine Learning Research, 4:119-155, 2003) and ISOMAP (Tenenbaum J et. al.., Science 290(5500):2319-2323, 2000), which make use of local neighborhoods to better calculate such proximities. DR methods are known to suffer from certain shortcomings (e.g., sensitivity to noise and/or change in parameters). A number of techniques have been proposed recently to overcome these shortcomings. Samko, O et al. (Pattern Recognition Letters 27(9):968-979, 2006) and Kouropteva O et al. (In Proc. 1st Int'l Conf. Fuzzy Systems and Knowledge Discovery, 359-363, 2002) proposed methods to choose the optimal neighborhood parameter for ISOMAP and LLE, respectively. This was done by first constructing multiple embeddings based on an intelligently selected subset of parameter values, and then choosing the embedding with the minimum residual variance. Attempts have been made to overcome problems due to noisy data by selecting data objects known to be most representative of their local neighborhood (landmarks) in ISOMAP (de Silva V, Tenenbaum J: Global Versus Local Methods in Nonlinear Dimensionality Reduction. In Proc. Adv. Neural Information Processing Systems (NIPS), Volume 15, MIT Press, 705 712, 2003), or estimating neighborhoods in LLE via selection of data objects that are unlikely to be outliers (noise) (Chang H and Yeung D: Robust locally linear embedding. Pattern Recognition, 39(6):1053-1065, 2006). Similarly, graph embedding (GE) has also been explored with respect to issues such as the scale of analysis and determining accurate groups in the data (Zelnik-Manor L, Perona P: Self-tuning spectral clustering. In Proc. Adv. Neural Information Processing Systems (NIPS), Volume 17, MIT Press, 1601-1608, 2004).
However, all of these methods require an exhaustive search of the parameter space in order to best save the specific problem being addressed. Alternatively, one may utilize class information within the supervised variants (Geng X et al., classification. IEEE Transactions on Systems, Man, and Cybernetics: Part B, Cybernetics 35(6):1098-107, 2005; de Ridder, D et al., In Proc. Artificial Neural Networks and Neural Information Processing, 333-341, 2003) of ISOMAP and LLE, which attempt to construct weighted neighborhood graphs that explicitly preserve class information while embedding the data.
1.3. Learning in the Context of Dimensionality Reduction
The application of classification theory to dimensionality reduction (DR) has begun to be explored recently. Athitsos et al presented a nearest neighbor retrieval method known as BoostMap (Athitsos V et al, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:89-104, 2008), in which distances from different reference objects are combined via boosting. The problem of selecting and weighting the most relevant distances to reference objects was posed in terms of classification in order to utilize the Adaboost algorithm (Freund Y and Schapire R, A decision-theoretic generalization of on-line learning and an application to boosting. In Proc. 2nd European Conf. Computational Learning Theory, Springer-Verlag, 23-37, 1995), and BoostMap was shown to improve the accuracy and speed of overall nearest neighbor discovery compared to traditional methods. DR has also previously been formulated in terms of maximizing the entropy (Lawrence N: Spectral Dimensionality Reduction via Maximum Entropy. In Proc. 14th Intn'l Conf. Artificial Intelligence and Statistics (AISTATS), Volume 15, 51-59, 2011) or via a simultaneous dimensionality reduction and regression methodology involving Bayesian mixture modeling (Mao, K. et al, Supervised Dimension Reduction Using Bayesian Mixture Modeling. In Proc. 13th Intn'l Conf. Artificial Intelligence and Statistics (AISTATS), Volume 9:501-508, 2010). The goal in such methods is to probabilistically estimate the relationships between points based on objective functions that are dependent on the data labels (Lawrence N: Spectral Dimensionality Reduction via Maximum Entropy. In Proc. 14th Intn'l Conf. Artificial Intelligence and Statistics (AISTATS), Volume 15, 51-59, 2011). These methods have been demonstrated in the context of application of PCA to non-linear datasets (Mao, K. et al., Supervised Dimension Reduction Using Bayesian Mixture Modeling, In Proc. 13th Intn'l Conf. Artificial Intelligence and Statistics (AISTATS), Volume 9 2010:501-508). More recently, investigators using multi-view learning algorithms (Blum A and Mitchell T: Combining labeled and unlabeled data with co-training. In Proc. 11th Annual Conf. Computational Learning Theory 92-100, 1998; Hou et al. (Multiple view semi-supervised dimensionality reduction. Pattern Recognition 43(3):720-730, 2009) have attempted to address the problem of improving the learning ability of a system by considering several disjoint subsets of features (views) of the data. In Hou, given that a hidden pattern exists in a dataset, different views of this data are each embedded and transformed such that known domain information (encoded via pairwise link constraints) is preserved within a common frame of reference. The authors then solve for a consensus pattern which is considered the best approximation of the underlying hidden pattern being solved for. A similar idea was examined in (Wachinger C et al., Manifold Learning for Image-Based Breathing Gating with Application to 4D Ultrasound. In Proc. 13th Intl Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI), Volume 6362, 26-33, 2010; Wachinger C et al., Manifold Learning for Multi-Modal Image Registration. In Proc. 11th British Machine Vision Conference (BMVC), 82.1-82.12, 2010) where ID projections of image data were co-registered in order to better perform operations such as image-based breathing gating as well as multi-modal registration. Such algorithms involve explicit transformations of embedding data to a target frame of reference, as well as being semi-supervised in encoding specific link constraints in the data.
2. Enhanced Multi-Protocol Analysis Via Intelligent Supervised Embedding (EMPrAvISE).
Quantitative integration of multi-channel (modalities, protocols) information allows for construction of sophisticated meta-classifiers for identification of disease presence (Lee, G. et al., Proc. ISBI, 2009, 77-80); Viswanath, S. et al, SPIE Medical Imaging: Computer-Aided Diagnosis, 2009, 7260: 726031). Such multi-channel meta-classifiers have been shown to perform significantly better compared to any individual data channel (Lee, G. et al., Proc. ISBI, 2009, 77-80). From an intuitive perspective, this is because the different channels of information each capture complementary sets of information. For example, the detection accuracy and qualitative characterization of prostate cancer (CaP) in vivo has been shown to significantly improve when multiple magnetic resonance imaging (MRI) protocols are considered in combination, as compared to using individual imaging protocols. These protocols include: (1) T2-weighted MRI (T2w), capturing high resolution anatomical information, (2) Dynamic Contrast Enhanced MRI (DCE), characterizing micro-vascular function via uptake and washout of a paramagnetic contrast agent, and (3) Diffusion Weighted Imaging MRI (DWI), capturing water diffusion restriction via an Apparent Diffusion Coefficient (ADC) map. DCE and DWI MRI represent functional information, which complements structural information from T2w MRI (Kitajima, K. et al., Magn Reson Imaging, 2010, 31(3), 625-631).
Some of the most significant challenges involves quantitatively integrating multiparametric (T2w, DCE, DWI) MRI to construct a meta-classifier to detect prostate cancer (CaP). First, the issue of data alignment needs to be addressed, in order to bring the multiple channels of information (T2w, DCE, and DWI MRI) into the same spatial frame of reference, as explained for example by Viswanath et al. (Viswanath, S. et al., “Integrating structural and functional imaging for computer assisted detection of prostate cancer on multi-protocol in vivo 3 Tesla MRI,” in [SPIE Medical Imaging: Computer-Aided Diagnosis], 2009, 7260: 726031). This can be done via image registration techniques, described for example in (Madabhushi, A. et al., “Combined Feature Ensemble Mutual Information Image Registration,” U.S. Patent Publication number: US 2010/0177944 A1. Such image registration techniques account for differences in resolution amongst the different protocols. Post-alignment, the second challenge, knowledge representation, requires quantitative characterization of disease-pertinent information. Towards this end, textural and functional image feature extraction schemes previously developed in the context of multi-parametric MRI may be employed, such as described by Viswanath et al. (Viswanath, S. et al., “Integrating structural and functional imaging for computer assisted detection of prostate cancer on multi-protocol in vivo 3 Tesla MRI,” in [SPIE Medical Imaging: Computer-Aided Diagnosis], 2009, 7260: 726031) and Madabhushi et al. (Madabhushi, A. et al., “Automated Detection of Prostatic Adenocarcinoma from High-Resolution Ex Vivo MRI,” Medical Imaging, IEEE Transactions on, 2005, 24(12), 1611-1625).
The final step, data fusion, involves some combination of extracted quantitative descriptors to construct the integrated meta-classifier. Dimensionality reduction (DR), as described by Shi et al. (Shi, J. et al., “Pattern Analysis and Machine Intelligence,” IEEE Transactions on, 2000, 22(8), 888-905), has been shown to be useful for such quantitative fusion, further described by Viswanath et al. (Viswanath, S et al., “A Comprehensive Segmentation, Registration, and Cancer Detection Scheme on 3 Tesla In Vivo Prostate DCE-MRI,” in [Proc. MICCAI], 2008, 662-669). DR allows construction of a lower-dimensional embedding space, which accounts for differences in scale between the different protocols, as well as avoiding the curse of dimensionality. While the image descriptors are divorced from their physical meaning in embedding space (embedding features are not readily interpretable), relevant class-discriminatory information is largely preserved, as described by Lee et al. (Lee, G. et al., “Computational Biology and Bioinformatics,” IEEE Transactions on, 2008, 5(3): 1-17). This makes DR suitable for multi-parametric classification.
Multi-modal data fusion strategies may be categorized as combination of data (COD) (where the information from each channel is combined prior to classification), and combination of interpretations (COI) (where independent classifications based on the individual channels are combined), as shown in FIG. 1. A COI approach has typically been shown to be sub-optimal, as inter-protocol dependencies are not accounted for, as described by Lee et al. (Lee, G. et al., Proc. ISBI, 2009, 77-80).
Thus, a number of COD strategies with the express purpose of building integrated quantitative meta-classifiers have recently been presented, including DR-based (Lee, G. et al., Proc. ISBI, 2009, 77-80), kernel-based (Lanckriet, G. et al., Pac Symp Biocomput], 2004, 300-11) and feature-based (Verma, et al, Academic Radiology, 2008, 15(8): 966-977) approaches.
Multi-kernel learning (MKL) schemes, such as described in Lanckriet et al. (Lanckriet, G. R. et al., “Kernel-based data fusion and its application to protein function prediction in yeast,” in Pac Symp Biocomput], 2004, 300-311), represent and fuse multi-modal data based on choice of kernel. One of the challenges with MKL schemes is to identify an appropriate kernel for a particular problem, followed by learning associated weights. The most common approach for quantitative multi-parametric image data integration has involved concatenation of multi-parametric features, followed by classification in the concatenated feature space, as described by Verma et al. (Verma, R. et al., “Multiparametric Tissue Characterization of Brain Neoplasms and Their Recurrence Using Pattern Classification of MR Images,” Academic Radiology, 2008, 15(8): 966-977).
Chan et al. (Chan et al., Medical Physics, 2003, 30(6): 2390-2398) used a concatenation approach in combining texture features from multi-parametric (T2w, line-scan diffusion, T2-mapping) 1.5 T in vivo prostate MRI to generate a statistical probability map for CaP presence via a Support Vector Machine (SVM) classifier. A Markov Random Field-based algorithm, as described by Liu et al. (Liu, X. et al., “Medical Imaging,” IEEE Transactions on, 2009, 28(6): 906-915) as well as variants of the SVM algorithm Artan et al. (Artan, Y. et al., “Image Processing,” IEEE Transactions on, 2010, 19(9): 2444-55); Ozer et al. (Ozer, S. et al, Medical Physics, 2010, 37(4): 1873-1883) were utilized to segment CaP regions on multi-parametric MRI via concatenation of quantitative descriptors such as T2w intensity, pharmacokinetic parameters (from DCE), and ADC maps (from DWI).
Lee et al. (Lee, G. et al., Proc. ISBI, 2009, 77-80) proposed data representation and subsequent fusion of the different modalities in a “meta-space” constructed using DR methods such as Graph Embedding (GE), as described by Shi et al. (Shi, J. et al., IEEE Transactions on, 2000, 22(8): 888-905). However, DR analysis of a high-dimensional feature space may not necessarily yield optimal results for multi-parametric representation and fusion due (a) to noise in the original N-D space which may adversely affect the embedding projection, or (b) to sensitivity to choice of parameters being specified during DR. For example, GE is known to suffer from issues relating to the scale of analysis as well as relating to the choice of parameters used in the method, as described by Zelnik-Manor et al. (Zelnik-Manor, L. et al., “Self-tuning spectral clustering,” in [Advances in Neural Information Processing Systems], 2004, 17: 1601-1608, MIT Press). Varying these parameters can result in significantly different appearing embeddings, with no way of determining which embedding is optimal for the purposes of multi-parametric data integration and classification. There is hence a clear need for a DR scheme that is less sensitive to choice of parameters, while simultaneously providing a quantitative framework for multi-parametric data fusion and subsequent classification.
Researchers have attempted to address problems of sensitivity to noise and choice of parameters in the context of automated classification schemes via the development of classifier ensembles, as described by Freund et al. (Freund, Y. et al., “A decision-theoretic generalization of on-line learning and an application to boosting,” in [Proc. 2nd European Conf. Computational Learning Theory], 1995, 23-37, Springer-Verlag)); and Breiman et al. (Breiman, L., Machine Learning, 1996, 24(2): 123-140). These algorithms combine multiple “weak” classifiers to construct a “strong” classifier which has an overall probability of error that is lower compared to any of the individual weak classifiers. Related work, which applies ensemble theory in the context of DR, has been presented by Hou et al. (Hou, C. et al., Pattern Recognition, 2009, 43(3): 720-730), involving a semi-supervised ensemble of DR representations within multi-view learning framework for web data mining. Similarly, Athisos et al. (Athisos et al., IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008, 30(1): 89-104) employed an ensemble algorithm for nearest neighbor discovery via DR within a content retrieval system.
Significance of Ensemble Embedding
The described invention provides a dimensionality reduction (DR) scheme, known as ensemble embedding that involves first generating and then combining multiple uncorrelated, independent (or base) n-D embeddings. These base embeddings may be obtained via either linear or non-linear DR techniques being applied to a large N-D feature space. Techniques to generate multiple base embeddings may be seen to be analogous to those for constructing classifier ensembles. In the latter, base classifiers with significant variance can be generated by varying the parameter associated with the classification method (k in kNN classifiers (Cover T and Hart P, IEEE Transactions on Information Theory, 13:21-27, 1967) or by varying the training data (combining decision trees via Bagging (Breiman L: Bagging predictors. Machine Learning, 24(2):123-140, 1996). Previously, a consensus method for LLE was examined (Tiwari P et al, Consensus-locally linear embedding (C-LLE): application to prostate cancer detection on magnetic resonance spectroscopy. In Proc. 11th Int'l Conf. Medical Image Computing and Computer-Assisted Intervention (MICCAI), Volume 5242(2), 330-338, 2008) with the underlying hypothesis that varying the neighborhood parameter (κ) will effectively generate multiple uncorrelated, independent embeddings for the purposes of constructing an ensemble embedding. The combination of such base embeddings for magnetic resonance spectroscopy data was found to result in a low-dimensional data representation, which enabled improved discrimination of cancerous and benign spectra compared to using any single application of LLE.
The described invention considers an approach inspired by random forests (Ho T, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832 844, 1998) (which in turn is a modification of the Bagging algorithm (Breiman L: Bagging predictors. Machine Learning 24(2):123-140, 1996), where variations within the feature data are used to generate multiple embeddings, which are then combined via the ensemble embedding scheme of the present invention. Additionally, unlike most current DR approaches, which require tuning of associated parameters for optimal performance in different datasets, ensemble embedding offers a methodology that is not significantly sensitive to parameter choice or dataset type.
The described invention provides a method and system of classifying digital image containing multi-parametric data derived from a biological sample by representing and fusing the multi-parametric data via a multi-protocol analysis using an intelligent supervised embedding (EMPrAvISE) scheme that uses a DR method referred to herein as “ensemble embedding.” The method and system constructs a single stable embedding by generating and combining multiple uncorrelated, independent embeddings derived from the multiparametric feature space, and better preserves class-discriminatory information as compared to any of the individual embeddings used in its construction.
The described invention (1) provides a framework for multi-parametric data analysis, (2) intelligently selects embeddings for combination via quantifying the nature of the embeddings, and (3) utilizes a supervised classification scheme to ensure that class-discriminatory information is specifically preserved within the final representation.
It inherently accounts for (1) differences in dimensionalities between individual protocols (via DR), (2) noise and parameter sensitivity issues with DR-based representation (via the use of an ensemble of embeddings), and (3) inter-protocol dependencies in the data (via intelligent ensemble embedding construction).