Disclosed herein is a novel tuning mechanism for Gaussian or Radial Basis Function (RBF) kernels where each attribute (or feature) is characterized by its own Parzen window sigma. The kernel trick is frequently used in machine learning to transform the input domain into a feature domain where linear methods are then used to find an optimal solution to a regression or classification problem. Support Vector Machines (SVM), Kernel Principal Component Regression (K-PCR), Kernel Ridge Regression (K-RR), Kernel Partial Least Squares (K-PLS) are examples of techniques that apply kernels for machine learning and data mining. There are many different possible kernels, but the RBF (Gaussian) kernel is one of the most popular ones. Equation (1) represents a single element in the RBF kernel,
                              k          ⁡                      (                          i              ,              j                        )                          =                  ⅇ                      -                                                                                                                        x                      i                                        -                                          x                      j                                                                                        2                                            2                ⁢                                  σ                  2                                                                                        (        1        )            where xi and xj denote two sample data. Traditionally, most machine learning approaches use a single value σ in the RBF kernel (as indicated in the equation above), which then needs to be tuned on a validation or tuning data set. Here, each attribute is associated with a different σ value which is then tuned based on a validation data set with the aim to achieve a prediction performance that is an improvement over the one achieved by the RBF kernels with a single σ. The expression for a single RBF kernel entry becomes,
                              k          ⁡                      (                          i              ,              j                        )                          =                              ∏                          l              =              1                        m                    ⁢                                          ⁢                      ⅇ                          -                                                                                                                                      x                        i                        t                                            -                                              x                        j                        t                                                                                                  2                                                  2                  ⁢                                      σ                    t                    2                                                                                                          (        2        )            where m is the number of attributes in the sample data. There are several advantages of using an automated tuning algorithm for a vector of σ rather than selecting a single scalar variable:                Manual tuning for multiple σ-values is a tedious procedure;        The same automated procedure applies to most machine learning methods that use an RBF kernel;        The values of the optimized a can be used as a gauge for variable selection (Specht, 1990).        
Automated tuning of the kernel parameters is an important problem, it could be used in all different scientific applications: such as image classification (Guo, 2008; Claude, 2010) and time series data forecasting (He, 2008; Rubio, 2010), etc. A number of researchers have proposed algorithms for solving it, especially in the context of SVMs. Related work includes Grandvalet et al. (Grandvalet, 2002), which introduced an algorithm for automatic relevance determination of input variables in SVMs. Relevance is measured by scale factors defining the input space metric. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Cristianini et al. (Cristianini, 1998) applied an iterative optimization scheme to estimate a single kernel width hyper-parameter in SVM classifiers. In its procedure, model selection and learning are not separate, but kernels are dynamically adjusted during the learning process to find the kernel parameter which provides the best possible upper bound on the generalization error. Chapelle et al. (Chapelle, 2002) extend the single kernel width hyper-parameter to multiple-sigma parameters for solving the same problem in SVMs in order to perform adaptive scaling and variable selection. An example of this method is extended to Gaussian Automatic Relevance Determination kernel via optimization of kernel polarization (Wang, 2010). A further extension includes a multi-Class feature selection in the application of text classification (Chapelle, 2008). Chapelle et al.'s method has the advantage that the gradients are computed analytically as opposed to the empirical approximation used in this paper. The algorithm proposed here is very similar to the one proposed by Chapelle et al. However, the approach here is different in the sense that we use a Levenberg-Marquardt-like optimization approach, which uses a λ parameter that gradually changes the algorithm from a first-order to a second-order. In addition, we use a Q2 error metric which shows more robustness on unbalanced data sets and a leave-several-out validation option for improved computing time. Finally, we apply the algorithm to K-PLS rather than SVMs.
Partial Least Squares (PLS) (H. Wold, 1966) was introduced by Swedish statistician Herman Wold for econometrics modeling of multi-variate time series. Currently PLS has become one of the most popular and powerful tools in chemometrics and drug design after it was applied to chemometrics in the early eighties (S. Wold, 2001). PLS can be viewed as a “better” Principal Components Analysis (PCA) regression method, where the data are first transformed into a different and non-orthogonal basis and only the most important PLS components (or latent variables) are considered for building a regression model (similar to PCA). The difference between PLS and PCA is that the new set of basis vectors in PLS is not a set of successive orthogonal directions that explain the largest variance in the data, but are actually a set of conjugant gradient vectors to the correlation matrix that form a Krylov space (Ilse, 1998), a widely used iterative method for successfully solving large system of linear equations in order to avoid matrix-matrix operations, currently available in numerical linear algebra. PLS regression is one of the most powerful data mining tools for large data sets with many variables with high collinearity. The NIPALS implementation of PLS (H. Wold, 1975) is elegant and fast.
Linear Kernel Partial Least Squares (K-PLS) was first described in (Lindgren, 1993) and applied to spectral analysis in the late nineties of twentieth century (Liu, 1999). Instead of linear K-PLS, Rosipal introduced K-PLS in 2001 (Rosipal, 2001) as a nonlinear extension to the PLS. This nonlinear extension of PLS makes K-PLS a powerful machine learning tool for classification as well as regression. K-PLS can also be formulated as a paradigm closely related to Support Vector Machines (SVM) (Vapnik, 1998; Boser, 1992; Bennett, 2003). In addition, the statistical consistency of K-PLS is recently proved from theoretical perspective (Blanchard, 2010).
Since K-PLS was introduced in 2001, researchers in chemometrics have gradually switched from PLS to K-PLS as a standard tool for the data mining (Embrechts, 2007; Tian, 2009). Meanwhile, K-PLS has been attracted by other researchers for different industrial applications such as face recognition ({hacek over (S)}truc, 2009) and financial forecasting (Huang, 2010). In the specific domain (electrocardiogram, echocardiogram, and angiogram, etc.) where signal is retrieved through sensor, machine learning has become a crucial tool for the signal analysis. PLS combining with different signal preprocess techniques are applied in different research projects. Partial least squares logistic regression was used for electroencephalograms for early detection of patients with probable Alzheimer's disease (Lehmann, 2007). Chen et al. (Chen, 2009) conducted partial least squares with Fourier transform in the near infrared reflectance spectroscopy to analyze the main catechins contents in green tea. In this disclosure, a sigma tuning of Gaussian kernel is applied on the magnetocardiogram/graph for the diagnosis of ischemia heart disease. The sigma tuning procedure is implemented for a K-PLS model. The justification here for using K-PLS is that there is generally no significant difference in performance between K-PLS and other kernel-based learning methods such as SVMs (Han, 2006).
For background, the following references are referred to at various points in this application:    Bennett, K. & Embrechts, M. (2003). An Optimization Perspective on Kernel Partial Least Squares Regression. In J. Suykens, G. Horvath, C. M. S. Basu, & J. Vandewalle (Ed.), Advances in Learning Theory: Methods, Models and Applications, volume 190 of NATO Science III: Computer & Systems Sciences (pp. 227-250). Amsterdam: IOS Press.    Bi, J., Bennett, K., Embrechts, M., Breneman, C., & Song, M. (2003). Dimensionality Reduction via Sparse Support Vector Machines. Journal of Machine Learning Research, 3, 1229-1243.    Blanchard, G., & Krämer, N. (2010). Kernel Partial Least Squares is Universally Consistent. Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), Sardinia, Italy.    Blum, A. & Langley, P. (1997). Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, 1-2, 245-271.    Boser, B., Guyon, I., & Vapnik, V. (1992). A Training Algorithm for optimal Margin Classifiers. 5th Annual ACM Workshop on COLT, Pittsburgh, Pa., ACM Press.    Bradley, A. (1997). The Use of the area under the ROC curve in Evaluation of Machine Learning Algorithms. Pattern Recognition, 30(7), 1145-1159.    Chang, C. & Lin, C. LIBSVM: A Library for Support Vector Machines. Accessed 5 Sep. 2004, from http://www.csie.ntu.edu.tw/˜cjlin/libsvm.    Chapelle, O. & Vapnik, V. (2002). Choosing Multiple Parameters for Support Vector Machines. Machine Learning, 46(1-3), 131-159.    Chapelle, O. & Keerthi, S. (2008). Multi-Class Feature Selection with Support Vector Machines. Proc of American Statistical Association.     Chen, Q., Zhao, J., Chaitep, S., & Guo, Z. (2009). Simultaneous analysis of main catechins contents in green tea (Camellia sinensis (L.)) by Fourier transform near infrared reflectance (FT-NIR) spectroscopy. Food Chemistry, 113(4), 1272-1277.    Cristianini, N. & Campbell, C. (1998). Dynamically Adapting Kernels in Support Vector Machines. Neural Information Processing Systems.    Cristianini, N. & Shawe-Taylor, J. (2000). Support Vector Machines and Other Kernel based Learning Methods. Cambridge University Press.    Embrechts, M., Bress, R., & Kewley, R. (2005). Feature Selection via Sensitivity Analysis with Direct Kernel PLS. In I. Guyon and S. Gunn (Ed.), Feature Extraction. New York, N.Y.: Springer-Verlag.    Embrechts, M., Szymanski, B., & Sternickel, K. (2004). Introduction to Scientific Data Mining: Direct Kernel Methods and Applications. In S. Ovaska (Ed.), Computationally Intelligent Hybrid Systems: The Fusion of Soft and Hard Computing (pp. 317-362). New York, N.Y.: John Wiley.    Embrechts, M., Ekins, S. (2007). Classification of metabolites with kernel-partial least squares (K-PLS). Drug Metabolism and Disposition, 35(3), 325-327.    Fawcett, T. (2003). ROC Graphs: Notes and Practical Considerations for Data Mining Researchers. Technical Report HPL-2003-4, Hewlett Packard, Palo Alto, Calif.    Fawcett, T. & Provost, F. (2001). Robust Classification for Imprecise Environments. Machine Learning Journal, 42(3), 203-231.    Fillion, C. & Sharma G. (2010). Detecting Content Adaptive Scaling of Images for Forensic Applications. In N. Memon, J. Dittmann, A. Alattar and E. Delp III (Ed.), Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol 7541    Golbraikh, A. & Tropsha, A. (2002). Beware of q2!. Journal of Molecular Graphics and Modeling, 20, 267-276.    Grandvalet, Y. & Canu, S. (2002). Adaptive Scaling for Feature Selection in SVMs. Neural Information Processing Systems.    Guo, B., Gunn, S., Damper, R. I., & Nelson, J. (2008) Customizing Kernel Functions for SVM-Based Hyperspectral Image Classification. IEEE TRANSACTIONS ON IMAGE PROCESSING, 17(4), 622-629.    Guyon, I. & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182.    Ham, F. & Kostanic, I. (2001). Principles of Neurocomputing for Science and Engineering. McGraw Hill.    Han, L., Embrechts, M., Szymanski, B., Sternickel, K., & Ross, A. (2006). Random Forests Feature Selection with K-PLS: Detecting Ischemia from Magnetocardiograms. European Symposium on Artificial Neural Networks, Bruges, Belgium.    Hastie, T., Tibshirani, R., & Friedman, J. (2003). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, N.Y.: Springer.    He, W., Wang, Z., & Jiang, H. (2008). Model optimizing and feature selecting for support vector regression in time series forecasting. Neurocomputing, 72(1-3), 600-611.    Huang, S., & Wu, T. (2010). Integrating recurrent SOM with wavelet-based kernel partial least squares regressions for financial forecasting. Expert Systems with Applications, 37(8), 5698-5705.    Ilse, C. & Meyer, C. (1998). The Idea behind Krylov Methods. American Mathematical Monthly, 105, 889-899.    Lehmann, C., Koenig, T., Jelic, V., Prichep, L., John, R., Wahlund, L., Dodge, Y., & Dierks, T. (2007) Application and comparison of classification algorithms for recognition of Alzheimer's disease in electrical brain activity (EEG). Journal of Neuroscience Method, 161(2), 342-350.    Lindgren, F., Geladi, P., & Wold, S. (1993). The Kernel Algorithm for PLS. Journal of Chemometrics, 7, 45-49.    Liu, S. & Wang, W. (1999). A study on the Applicability on Multicomponent Calibration Methods in Chemometrics. Chemometrics and Intelligent laboratory systems, 45, 131-145.    Masters, T. (1995). Advanced Algorithms for Neural Networks: A C++ Sourcebook. New York, N.Y.: John Wiley & Sons.    Newman, D., Hettich, S., Blake, C., & Merz, C. (1998). UCI Repository of Machine Learning Databases.    Rosipal, R. and Trejo, L. (2001). Kernel Partial Least Squares Regression in Reproducing Kernel Hillbert Spaces. Journal of Machine Learning Research, 2, 97-128.    Rousseauw, J., du Plessis, J., Benade, A., Jordann, P., Kotze, J., Jooste, P., & Ferreira, J. (1983). Coronary risk factor screening in three rural communities. South African Medical Journal, 64, 430-436.    Rubio, G., Herrera, L., Pomares, H., Rojas, I., & Guillen, A. (2010). Design of Specific-to-problem kernels and use of kernel weighted K-nearest neighbors for time series modeling. Neurocomputing, 73(10-12), 1965-1975.    Specht, D. F. (1990). Probabilistic Neural Networks. Neural Networks, 3, 109-118.    {hacek over (S)}truc, V., & Pave{hacek over (S)}ić, N. (2009) Gabor-Based Kernel Partial-Least Squares Discrimination for Face Recognition. Informatica, 20, 115-138.    Suykens, J., Gestel, T., Brabanter, J., Moor, B., and Vandewalle, J. (2003). Least Squares Support Vector Machines. World Scientific Publishing Company.    Swets, J., Dawes, R., & Monahan, J. (2000, October). Better Decisions through Science. Scientific American, 82-87.    Tian, H., Tian, X., Deng, X., & Wang, P. (2009). Soft Sensor for Polypropylene Melt Index Based on Adaptive Kernel Partial Least Squares. Control and Instruments in Chemical Industry.     Vapnik, V. (1998). Statistical Learning Theory. John Wiley & Sons.    Wang, T., Huang, H., Tian, S., & Xu, J. (2010). Feature selection for SVM via optimization of kernel polarization with Gaussian ARD kernels. Expert Systems with Application, 37(9), 6663-6668.    Wold, H. (1996). Estimation of Principal Components and related Models by Iterative Least Squares. In P. Krishnaiah (Ed.), Multivariate Analysis (pp. 391-420). New York N.Y.: Academic Press.    Wold, H. (1975). Path with Latent Variables: The NIPALS Approach. In H. M. Balock (Ed.), Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building (pp. 307-357). New York N.Y.: Academic Press.    Wold, S., Sjōlstrōm, M., & Erikson, L. (2001). PLS-Regression: A Basic Tool of Chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, 109-130.
Additional recommended background reading selections include the following:    Bennett, K. & Embrechts, M. (2003). An Optimization Perspective on Kernel Partial Least Squares Regression. In J. Suykens, G. Horvath, C. M. S. Basu, & J. Vandewalle (Ed.), Advances in Learning Theory: Methods, Models and Applications, volume 190 of NATO Science III: Computer & Systems Sciences (pp. 227-250). Amsterdam: IOS Press.    Chapelle, O. & Vapnik, V. (2002). Choosing Multiple Parameters for Support Vector Machines. Machine Learning, 46(1-3), 131-159.    Cristianini, N. & Shawe-Taylor, J. (2000). Support Vector Machines and Other Kernel based Learning Methods. Cambridge University Press.    Embrechts, M., Szymanski, B., & Sternickel, K. (2004). Introduction to Scientific Data Mining: Direct Kernel Methods and Applications. In S. Ovaska (Ed.), Computationally Intelligent Hybrid Systems: The Fusion of Soft and Hard Computing (pp. 317-362). New York, N.Y.: John Wiley.    Embrechts, M., Bress, R., & Kewley, R. (2005). Feature Selection via Sensitivity Analysis with Direct Kernel PLS. In I. Guyon and S. Gunn (Ed.), Feature Extraction. New York, N.Y.: Springer-Verlag.    Han, L., Embrechts, M., Chen, Y., & Zhang, X. (2006). Kernel Partial Least Squares for Terahertz Radiation Spectral Source Identification. IEEE World Congress on Computational Intelligence.    Embrechts, M., Szymanski, B., Sternickel, K., Naenna, T., and Bragaspathi, R. (2003). Use of Machine Learning for Classification of Magnetocardiograms. Proceeding of IEEE Conference on System, Man and Cybernetics, Washington D.C.    Kim, K., Kwon, H., Lee, Y. H., Kim, T. E., Kim, J. M., Park Y. K., Moon, J. Y., Ko, Y. G. and Chung, N. (2005). Clinical Parameter Assessment in Magnetocardiography by Using the Support Vector Machine. IJBEM, Vol. 7, No. 1.    Rosipal, R. & Trejo, L. (2001). Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Spaces. Journal of Machine Learning Research, 2, 97-128.    Schōlkopf, B. & Smola, A. (2002). Learning with Kernels. MIT Press.    Shawe-Taylor, J. & Cristianini, N. (2004). Kernel Methods for Pattern Analysis. Cambridge University Press.    Szymanski, B., Han, L., Embrechts, M., Ross, A., Sternickel, K., & Zhu, L. (2006). Using Efficient SUPANOVA Kernel for Heart Disease Diagnosis. Proceeding of ANNIE 2006, Intelligent Engineering Systems Through Artificial Neural Networks, St. Louis, Mo., ASME, New York, N.Y.    Vapnik, V. (1998). Statistical Learning Theory. John Wiley & Sons.    Wold, H. (1975). Path with Latent Variables: The NIPALS Approach. In H. M. Balock (Ed.), Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building (pp. 307-357). New York N.Y.: Academic Press.    Wold, H. (1996). Estimation of Principal Components and related Models by Iterative Least Squares. In P. Krishnaiah (Ed.), Multivariate Analysis (pp. 391-420). New York N.Y.: Academic Press.