1. Field
The subject invention relates to systems and methods for classifying optical spectroscopy image data of biological tissues such as cancer and other pathologies for medical diagnostics.
2. Related Art
Current screening for early detection of breast cancer, for example, is based on either abnormalities visible in a mammography or lumps detected by the patients or doctors using palpation. Before any treatment is initiated, the diagnosis must be confirmed. This is usually accomplished by performing a biopsy, which is an invasive procedure, and then determining the histology of the tumor. A less invasive alternative is the use of fine-needle aspiration cytology (FNA). FNA is infrequently used, however, due to a significant false-negatives rate. Approximately 50,000 diagnostic lumpectomies are performed annually in the U.S. Of those, only about 12,000 turn out to be malignant when histology is performed by a pathologist. If it had been known in advance that the remaining 38,000 lesions were benign, the potentially disfiguring surgery could have been avoided, as many benign lesions resolve spontaneously in time, without intervention.
Cervical cancer is the third most common cancer in women worldwide and is the leading cause of cancer mortality for women in developing countries. When precancerous lesions are detected early they are easily treatable by ablation or excision. At more advanced stages, cervical cancer often requires hysterectomy, chemotherapy, radiation therapy, or combined chemo-radiation therapy. Current screening for this type of cancer is accomplished first by a Papanicolaou (Pap) smear, with sensitivities and specificity values ranging from 11% to 99% and 14% to 97% respectively, and results usually available in two weeks. The second stage of the screening process, after an abnormal Pap smear, is a colposcopy. This test has an excellent sensitivity (>90%) but poor specificity (<50%), even in the hands of an experienced practitioner. Because of the poor specificity, a biopsy is required to confirm the diagnosis. Currently, women often wait up to eight weeks to be treated as part of the standard care in the diagnosis and treatment of cervical cancer after an abnormal Pap smear.
Barrett's Esophagus is a pre-cancerous condition that is an important risk factor in developing esophageal adenocarcinoma, the most common form of esophageal cancer. It is associated with chronic gastrointestinal reflux disease and is increasing in incidence in western countries. The development of malignancy is thought to be a progression from nondysplaic Barrett's mucosa, through low-grade dysplasia (LGD) to high-grade dysplasia (HGD), to carcinoma. Consequently, it is critical to identify patients with Barrett's esophagus that are most at risk of developing cancer. Patches of dysplasia within a section of Barrett's mucosa cannot be detected visually using conventional white light endoscopy. Diagnosis requires multiple random biopsies and subsequent histological examination. As many as 20-30 “random” biopsies may be taken in one session. This is a time consuming (and expensive) procedure, which entails some degree of risk for the patient. For each conventional biopsy, the biopsy tool must be withdrawn from the endoscope and the specimen removed before the tool can be reinserted for the next biopsy. Because biopsies are taken at random from within a section of Barrett's esophagus, detection of pre-cancerous changes is relatively poor.
In recent years, several spectroscopy techniques have been proposed as potential methods for distinguishing between different tissue pathologies. The motivation of these techniques is to reduce, or eliminate, the need for surgical removal of biopsy tissue samples. Instead, some form of spectral analysis of the tissue is applied to measurements obtained with an optical probe placed on or near the surface of the tissue in question. A diagnosis of the tissue is then attempted based on these measurements, in situ, noninvasively and in real time. Additionally, there is the potential for reduced health care cost and patient distress as a consequence of the reduced need for histology and the need for the surgical environment required to take the biopsy samples. Some of these proposed spectroscopic techniques include Raman spectroscopy, autofluorescence spectroscopy, fluorescence spectroscopy, reflectance spectroscopy, and elastic-scattering spectroscopy.
Screening and/or detection of cancer at an early stage is of significant importance as many incidences of the disease can be treated successfully at early stages. In recent years, these optical spectroscopy methods have received increased attention for this purpose, due to the fact that they possess some desirable properties—they are noninvasive, in situ, and results can be obtained almost in real time. These methods provide data sensitive to changes of the underlying tissue (e.g. structural, biochemical), which can be exploited for the development of diagnostic algorithms. Various statistical pattern recognition and machine learning methods have been used to develop these diagnostic algorithms.
For example, A MAP (Maximum A-Posteriori) classifier was used to distinguish between squamous intraepithelial lesions (SILs) and normal squamous epithelia, and to distinguish between high-grade squamous intraepithelial lesions (HGSILs) and low-grade squamous intraepithelial lesions (LGSILs) using fluorescence spectroscopy applied to cervical tissue. Posterior probabilities were computed after fitting the training data to a gamma function. A sensitivity and specificity of 82% and 68%, respectively, for the first case and of 79% and 78% for the second case were reported.
Linear discriminant analysis has also been used. A classification accuracy of 72% was reported for distinguishing malignant melanoma from benign nevi in the skin using reflectance spectra. Elastic-scattering spectroscopy was used to detect dysplasia in the esophagus. Sensitivity of 77% and specificity of 77% were obtained in detecting “high risk” biopsies. The same spectroscopy technique has been employed to detect cancer in the sentinel lymph node for breast cancer, with a resulting sensitivity of 75% and specificity of 89%.
Fisher's linear discriminant has also been used. This method obtains the linear function yielding the maximum ratio of between-class scatter to within-class scatter. Raman spectroscopy was used to distinguish between normal tissue, low-grade dysplasia, and high-grade dysplasia/carcinoma in situ using rat models. A specificity of 93% and sensitivity of 78% were obtained for detecting low-grade dysplasia, and a sensitivity and specificity of 100% was obtained for detecting high-grade dysplasia/carcinoma in situ. Fluorescence spectroscopy was applied in order to detect cancer in the oral cavity. The results were of a sensitivity of 73% and specificity of 92%, after selecting features using recursive feature elimination (RFE).
Reflectance and fluorescence spectroscopy, respectively, have been used to differentiate normal and precancerous (neoplastic) cervical tissue using a Mahalanobis distance classifier. A sensitivity of 72% and specificity of 81% were reported when discriminating between squamous normal tissue and high-grade squamous intraepithelial lesions, while a sensitivity of 72% and a specificity of 83% were obtained when discriminating columnar normal tissue from high-grade squamous intraepithelial lesions. An average sensitivity and specificity of 78% and 81% respectively were obtained when the pairwise analysis between squamous normal tissue, columnar normal tissue, low-grade squamous intraepithelial lesions and high-grade squamous intraepithelial lesions was done.
Another method being applied to spectroscopy data is artificial neural networks (ANN). These are typically known for being able to handle nonlinear problems. As an example of their use, an ANN classifier was used for distinguishing malignant melanoma from benign nevi in the skin using reflectance spectra, with a classification accuracy of 86.7% being reported. ANN yielded sensitivities of 69% and 58%, and specificities of 85% and 93%, for breast tissue and sentinel nodes, respectively, using data from elastic-scattering spectroscopy measurements.
In recent years, support vector machines (SVM) have received increased attention in these types of applications. This is in part due to the fact that SVMs exhibit good generalization capability and are able to yield nonlinear decision boundaries through the implicit mapping of the data to a higher dimensional space by the use of kernel functions. Linear SVMs were used to classify nonmalignant and malignant tissue from the breast measured with fluorescence spectroscopy, obtaining a sensitivity of 70% and specificity of 92%. SVMs with linear and radial basis function (RBF) kernels have also been used. Sensitivities of 94% and 95%, and specificities of 97% and 99%, respectively, were obtained for distinguishing normal tissue from nasopharyngeal carcinoma using autofluorescence spectra. Fluorescence spectroscopy has been applied in order to detect cancer in the oral cavity. The results were of a sensitivity of 88%, 90%, 93% and specificity of 94%, 95%, 97%, for linear, polynomial, and RFB SVMs respectively, after selecting features using recursive feature elimination (RFE).
As support vector machines (SVMs) have garnered increased attention for classification problems, several error-rejection rules have been presented for this type of classifier. For points near the optimal hyperplane the classifier may not be very confident in the class labels assigned. In these prior approaches, a rejection scheme was proposed in which samples whose distance to the separating hyperplane is below some threshold are rejected. A similar approach was used where the distance of tested data points to the optimal separating hyperplane was thresholded in order to reject a user-defined percentage of miss-classified patterns, allowing for the reduction of the expected risk. A ROC-based reject rule has been proposed for SVMs. The ROC curve for the SVM classifier is obtained by varying the decision threshold, which is assumed to be zero, from −∞ to ∞. The true positive and false positive rates are obtained from the class-conditional densities produced by the outputs of the SVMs. A distance reject threshold has also been presented for SVM classifiers. The SVM output is the distance of that particular input pattern to the optimal separating hyperplane.
The foundation of these error rejection rules can be traced to the work presented by Chow. These error rejection rules are disadvantageous because they assume that the probability distribution for each class is known. In most pattern recognition applications common parametric forms rarely fit the densities encountered in practice.
Chow first explored error rejection in the context of Bayes decision theory. Within this framework, a feature vector x is said to belong to class wk if
                                          P            ⁡                          (                                                w                  k                                /                x                            )                                =                                    max              i                        ⁢                          P              ⁡                              (                                                      w                    i                                    /                  x                                )                                                    ,                                  ⁢                  i          =          1                ,        …        ⁢                                  ,        N                            (        1        )            where P(wi/x) is the a posteriori probability and N is the total number of classes. This rule divides the feature space into N regions D1 . . . DN and classifies x as wk if it lies in the region Dk. Furthermore, it is optimal in the sense that it minimizes the probability of error, also called the Bayes error,
                              P          E                =                              ∑                          i              =              1                        N                    ⁢                                    ∫                              D                i                                      ⁢                                          ∑                                                      j                    =                    1                                                        j                    ≠                    i                                                  N                            ⁢                                                p                  ⁡                                      (                                          x                      /                                              w                        j                                                              )                                                  ⁢                                  P                  ⁡                                      (                                          w                      j                                        )                                                  ⁢                                                                  ⁢                                                      ⅆ                    x                                    .                                                                                        (        2        )            Chow introduces the reject option in order to obtain a probability of error lower than the Bayes error. This is accomplished by refraining from classifying patterns that are likely to be misclassified. Chow's rule states that a feature vector x is classified as belonging to class wk if
                                          max                                          i                =                1                            ,              …              ⁢                                                          ,              N                                ⁢                      P            ⁡                          (                                                w                  i                                /                x                            )                                      =                              P            ⁡                          (                                                w                  k                                /                x                            )                                ≥          t                                    (        3        )            and rejected if
                                          max                                          i                =                1                            ,              …              ⁢                                                          ,              N                                ⁢                      P            ⁡                          (                                                w                  i                                /                x                            )                                      =                              P            ⁡                          (                                                w                  k                                /                x                            )                                <          t                                    (        4        )            where t is the rejection threshold. Thus, the introduction of the reject option divides the feature space into N+1 decision regions D0, D1, . . . , DN and classifies x as wk if it lies in the region Dk and rejects it if it lies in D0. It is optimal since P(wk/x) is the conditional probability of correctly classifying the pattern x. Note that both the probability of error (2) and the probability of rejection
                              P          R                =                              ∫                          D              0                                ⁢                                    ∑                              i                =                1                            N                        ⁢                                          p                ⁡                                  (                                      x                    /                                          w                      i                                                        )                                            ⁢                              P                ⁡                                  (                                      w                    i                                    )                                            ⁢                                                          ⁢                              ⅆ                x                                                                        (        5        )            are now functions of the threshold t. Chow states that since (2) and (5) are monotonic functions of the threshold t, the performance of the recognition system is completely described by the curve resulting from (2) versus (5). In this error-reject tradeoff curve PE decreases and PR increases as the threshold t increases. In particular, PE equals the Bayes error and PR=0 for t=0, and PE=0 for t=1. A similar relationship was presented between false acceptance rates and false rejection rates as a function of the rejection threshold in the application of biometric verification systems.
Another rejection scheme was proposed to improve reliability in neural networks. It defines two classification rules and finds a threshold for each. Let Ok be the output node corresponding to class k, if the input samples corresponds to the kth class then the output node Ok=1 while all other equal zero. The first rule states that an input pattern belongs to class k if
                                          max                                          i                =                1                            ,              …              ⁢                                                          ,              N                                ⁢                      O            i                          =                              O            k                    ≥          σ                                    (        6        )            where σ is the rejection threshold, similar to Chow's rule. The second rule states thatOk−Oj<δ  (7)where Oj is the output node with the second highest value and δ is the rejection threshold. Thus, if the difference between the two highest output values is less than some threshold the input pattern is not classified. The two thresholds are then obtained by maximizing a performance function that depends on the error and rejection rates as well as their respective costs.
The optimality of Chow's rule has also been investigated. Some say that Chow's rule is optimal only if the posterior probabilities of the data classes are exactly known; however, it is generally not the case and posterior probabilities have to be estimated from the training data. As a result, sub-optimal results are obtained when this rule is applied to the estimated probabilities since the decision regions are shifted with respect to where they would be in the optimal case. The use of multiple class dependent thresholds has been proposed as a solution. In this approach Chow's rule is modified, a pattern x is classified as belonging to class wk if
                                          max                                          i                =                1                            ,              …              ⁢                                                          ,              N                                ⁢                                    P              ^                        ⁡                          (                                                w                  i                                /                x                            )                                      =                                            P              ^                        ⁡                          (                                                w                  k                                /                x                            )                                ≥                      t            k                                              (        8        )            and rejected if
                                          max                                          i                =                1                            ,              …              ⁢                                                          ,              N                                ⁢                                    P              ^                        ⁡                          (                                                w                  i                                /                x                            )                                      =                                            P              ^                        ⁡                          (                                                w                  k                                /                x                            )                                <                                    t              k                        .                                              (        9        )            Here {circumflex over (P)}(wi/x) is the estimated posterior probability. The thresholds are determined by maximizing the accuracy probability subject to maintaining the reject probability below a user defined value. Both the accuracy and reject probabilities are function of the class thresholds.
Another rejection rule was based on analysis of the Receiver Operating Characteristic (ROC) curve. The two classes are called Positive (P) and Negative (N) and the decision rule is defined as:                assign the sample to N if x<tN         assign the sample to P if x>tP         reject the sample if tN≦x≦tP where tN and tP (tN≦tP) are the rejection thresholds. The optimal thresholds maximize a performance function defined by the false negative, true negative, false positive, true positive and rejection rates, respective costs and prior probabilities. The solution yields a set of parallel straight lines whose slopes are determined by the costs and prior probabilities. The optimal values for the thresholds are then found by searching the point on the ROC curve, constructed by graphing the true positive rate versus false positive rate, which intersect these lines and have minimum value.        
Another rejection rule that has been considered deals with incomplete knowledge about classes. In this work two rejection thresholds were defined. The first, the ambiguity reject threshold, like Chow's rule aims to reject samples with high risk of misclassification in order to decrease the classification error. The second, denoted distance reject threshold, aims to decrease the probability of erroneously classifying an input pattern x into one of the N classes when it is “far” from the known classes (i.e. outliers). The assumption is that not all patterns come from one of the N previously defined classes. This same rejection rule was applied to neural networks classifiers. The optimum class-selective rejection rule is presented. This approach is another extension of the rejection rule, when an input pattern cannot be reliably classified as one of the defined N classes, instead of being rejected it is assigned to a subset of classes to which the pattern most likely belongs to. Thus the feature space is divided into 2N−1 decision regions instead of N+1 regions as in Chow's rule.
Even though these methods cover a wide range of applications using different spectroscopy methods and several types of classifiers, the sensitivities and specificities obtained do not vary much for each of the cases presented. With few exceptions, the average sensitivity and specificity fluctuates between 70% and 85%.
Thus, what is needed is an improved method for classifying optical spectroscopy data. These improved methods can be used to improve diagnosis and, therefore, treatment of cancer and other pathologies.