The present disclosure relates to digital pathology and, in particular, to the fully automated cancer diagnosis of an entire sample of tissue on a histological slide.
While recent studies in molecular biology have provided great advances for diagnostic molecular pathology, traditional histological diagnosis is still the most powerful method for diagnosing diseases. Although still mostly performed by pathologists using optical microscopes, histological diagnosis is currently undergoing the “digital revolution” that occurred in the fields of radiology and cytology. This revolution was sparked by the advent of high-resolution whole slide imaging (WSI) scanners and applications in remote diagnosis, teaching and archival systems are already taking advantage of the convenience afforded by digital files over glass slides. Next in line are automated or assisted diagnosis systems, where the computer analyzes imaged tissue sections to provide increased accuracy and speed to the clinical workflow.
However, automated analysis of H&E tissue sections by computer image analysis is extremely difficult for two main reasons. First, at high-power magnification, the segmentation of cells from the structures in which they are embedded is hard, making cell-based diagnosis very challenging. Second, many tumors manifest themselves as subtle changes in the structural fabric of the tissue, making it necessary to develop additional structural analysis algorithms at low-to-medium magnification. Those two types of analysis, taking place at different magnifications, must be combined to produce accurate diagnosis. Those difficulties are compounded by the presence of various histological conditions such as necrosis, hyperplasia, inflammation, etc. Furthermore, structural abnormalities of tissues and benign tumors may complicate the task. For these reasons, and despite a large amount of research, automated analysis of histological H&E tissue sections has so far had limited impact in the clinical workflow. Among the more mature systems we note the prostate cancer detection of Madabushi et al.
Machine learning has recently become the method of choice to tackle automated analysis of complex images. While the majority of computer-assisted diagnosis (CAD) systems use supervised learning, a key aspect of whole tissue classification makes this approach inefficient. While a negative-labeled tissue shows no sign of malignancies on its entire area, a positively-labeled tissue only shows malignancies on parts of the tissue. This problem has been generally addressed by having pathologists manually trace the tumor areas, thus providing definite positive labels. Unfortunately, this approach is labor intensive and cannot be scaled to large training sets, which, in turn, are essential to capture the wide range of conditions encountered in a typical clinical setting. Furthermore, pathologists are often loath to assign a label to small regions without taking into account a larger contextual area. Yet, the key to attaining adequate performance is the ability of a classifier to be trained on a large scale with real day-to-day data samples.
A solution to this problem is provided by the multi-instance learning framework (MIL). Typical supervised learning algorithms deal with instances represented by a single, fixed dimensionality feature vector, to which a label is assigned. In MIL, the input is instead a set of multiple vectors with a single label for the set. A positive label means that at least one instance in the set is labeled positive, while a negative label means that all instances in the set are labeled negative. Hence a tissue sample is segmented into a set of regions of interest (ROI). For positive tissues, one or more ROIs will contain evidence of cancer, while for negative tissues, no ROI will contain any sign of cancer. MIL has been successfully used in a wide range of applications, from drug activity prediction where it was first formalized by Dietterich to content-based image retrieval and face detection. Previous uses of MIL in histological sample analysis include Dundar et al. where it was used to train support vector machine (SVM) classifiers to differentiate between atypical ductal hyperplasia and ductal carcinoma in-situ in a small dataset of breast biopsy samples. More recently, the work of Xu et al. has shown the advantages of MIL for classification of histological tissues, albeit on a very small dataset of colon tissues.
Classifying ROIs is generally performed in two steps: feature extraction followed by classification. Extracting features may be computationally expensive as the number and complexity of the features increase (for example examining an entire breast biopsy tissue at high magnification looking for patterns of cancerous nuclei would take several hours). On the other hand, there is a vast amount of redundancy in histology tissue images. While these images often run in the giga-pixel range, patterns of interest tend to repeat themselves over the tissue. Also, some patterns may only be visible at high magnification, while others are readily visible at low magnification. Some patterns exhibit wide variations in shape and size (for example gland formations) while other have a distinct shape and size (for example a nucleolus). It is therefore advantageous to exploit this redundancy and the a-priori knowledge about patterns of interest to attempt to reduce the amount of computation needed to classify a tissue image.
One of the most common approaches to reducing computational cost incurred by feature acquisition is feature selection. This technique aims at reducing the number of features to a small subset without incurring a loss in classification accuracy. The main difference between feature selection and our proposed approach is that feature selection is typically done only once at training time. Once the subset of features have been selected, the same one is always used from then on, regardless of the situation. Our approach instead uses knowledge gained at training time to intelligently decide which feature to acquire at test time. In doing so, our system can adapt to the peculiarities of the given tissue being analyzed.
Another common approach is to build a cascade of classifiers. One or more features are grouped into a set and sets into a cascade of classifiers which are trained jointly but can be evaluated sequentially. These classifiers are tuned to produce very few false positives and the cascade is interrupted as soon as one classifier returns a positive answer. Others have used a “control” algorithm using a utility function but they explicitly compute its expected value. This approach, however, is only practical in cases where the features are low-dimensional and discrete. Neither approach addresses the classification of histological tissues. Yet other approaches address classification of histological tissues with a cascade of classifiers.